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IMAGE PROCESSING SYSTEM 

The present invention relates to the parametric modelling 
of the appearance of objects. The resulting model can be 
used, for example, to track the object, such as a human 
face, in a video sequence. 

The use of parametric models for image interpretation and 
synthesis has become increasing popular. Cootes et al 
have shown in their paper entitled "Active Shape Models - 
Their Training and Application" , Computer Vision and 
Image Understanding, Volume 61, No. 1, January, pages 38- 
59, 1995, how such parametric models can be used to model 
the variability of the shape and texture of human faces. 
They have mainly used these models for face recognition 
and tracking within video sequences, although they have 
also demonstrated that their model can be used to model 
the variability of other deformable objects, such as MRI 
scans of knee joints. The use of these models provides 
a basis for a broad range of applications since they 
explain the appearance of a given image in terms of a 
compact set of model parameters which can be used for 
higher levels of interpretation of the image. For 
example, when analysing face images, they can be used to 
characterise the identity, pose or expression of a face. 

Using such models for image interpretation requires, 
however, a method of fitting them to new image data. 
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This involves identifying the model parameters that 
generate an image which best fits (according to some 
measure) the new input image. Typically this problem is 
one of minimising the sum of squares of pixel errors 
between the generated image and the input image . In 
their paper entitled "Estimating Coloured 3D Face Models 
from Single Images : An Example-Based Approach" Vetter and 
Blanz have proposed a stochastic gradient descent 
optimisation technique to identify the optimum model 
parameters for the new image. Although this technique 
can give very accurate results finding the locally 
optimal solution, they generally get stuck in local 
minima since the error surface for the problem of fitting 
an appearance model to an image is particularly rough 
containing many local minima. Therefore, this 

minimisation technique often fails to converge on the 
global minimum. An additional drawback of this technique 
is that it is very slow requiring several minutes to 
achieve convergence. 

A faster, more robust technique known as the active 
appearance model was proposed by Edwards et al in the 
paper entitled "Interpreting Face Images using Active 
Appearance Models", published in the Third International 
Conference on Automatic Face and Gesture Recognition 
1998, pages 300-305, Japan, April 1998. This technique 
uses a prior training stage in which the relationship 
between model parameter displacements and the resulting 
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change in image error is learnt. Although the method is 
much faster than direct optimisation techniques, it also 
requires fairly accurate initial model parameters if the 
search is to converge. Additionally, this technique does 
5 not guarantee that the optimum parameters will be found. 



The appearance model proposed by Cootes et al includes a 
single appearance model matrix which linearly relates a 
set of parameters to corresponding image data. Blanz et 
10 al segmented the face into a number of completely 

independent appearance models, each of which is used to 
render a separate region of the face. The results are 
then merged using a general interpretation technique, 

15 The present invention aims to provide an alternative way 

of modelling the appearance of objects which will allow 
subsequent image interpretation through appropriate 
processing of parameters generated for the image. 



20 According to one aspect, the present invention provides 

a hierarchical parametric model for modelling the shape 
of an object, the model comprising data defining a 
hierarchical set of functions in which a function in a 
top layer of the hierarchy is operable to generate a set 

25 of output parameters from a set of input parameters and 

in which one or more functions in a bottom layer of the 
hierarchy are operable to receive parameters output from 
one or more functions from a higher layer of the 
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hierarchy and to generate therefrom the relative 
positions of a plurality of predetermined points on the 
object. Such a hierarchical parametric model has the 
advantage that small changes in some parts of the object 
5 can still be modelled by the parameters, even though they 

are significantly smaller than variations in other less 
important parts of the object. This model can be used 
for face tracking, video compression, 2D and 3D character 
generation, face recognition for security purposes, image 
10 editing etc. 

According to another aspect, the present invention 
provides an apparatus and method of determining a set of 
appearance parameters representative of the appearance of 

15 an object, the method comprising the steps of storing a 

hierarchical parametric model such as the one discussed 
above and at least one function which relates a change in 
input parameters to an error between actual appearance 
data for the object and appearance data determined from 

20 the set of input parameters and the parametric model; 

initially receiving a current set of input parameters for 
the object; determining appearance data for the object 
from the current set of input parameters and the stored 
parametric model; determining the error between the 

25 actual appearance data of the object and the appearance 

data determined from the current set of input parameters ; 
determining a change in the input parameters using the at 
least one stored function and said determined error; and 
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updating the current set of input parameters with the 
determined change in the input parameters. 

An exemplary embodiment of the present invention will now 
5 be described with reference to the accompanying drawings 

in which: 



Figure 1 is a schematic block diagram illustrating a 
general arrangement of a computer system which can be 
10 programmed to implement the present invention; 

Figure 2 is a block diagram of an appearance model 
generation unit which receives some of the image frames 
of a source video sequence together with a target image 
15 frame and generates therefrom an appearance model? 



Figure 3 is a block diagram of a target video sequence 
generation unit which generates a target video sequence 
from a source video sequence using a set of stored 
20 difference parameters; 

Figure 4 is a flow chart illustrating the processing 
steps which the target video sequence generation unit 
shown in Figure 3 performs to generate the target video 
25 sequence; 

Figure 5 schematically illustrates the form of a 
hierarchical appearance model generated in one embodiment 
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of the invention; 



Figure 6 shows a head with a mesh of triangular facets 
placed over the head and whose positions are defined by 
5 the position of landmark points at the corners of the 

facets ; 



Figure 7 is a flow chart illustrating the processing 
steps required to generate a facet appearance model from 
10 the training images; 



Figure 8 schematically illustrates the way in which a 
transformation is defined between a facet in a training 
image and a predefined shape of facet which allows 
15 texture information to be extracted from the facet; 



Figure 9 is a flow chart illustrating the main processing 
steps involved in determining an appearance model for the 
mouth using the appearance models for the facets which 
20 appear in the mouth and using the training images; 

Figure 10 schematically illustrates the way in which 
training images are used to determine some of the 
appearance models which form the hierarchical appearance 
25 model illustrated in Figure 5; 

Figure 11a is a flow chart illustrating the processing 
steps performed during a training routine to identify an 
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Active matrix associated with a current facet; 

Figure lib is a flow chart illustrating the processing 
steps performed during a training routine to identify an 
Active matrix associated with the mouth; 

Figure 12 is a flow chart illustrating the processing 
steps involved in determining a set of parameters which 
define the appearance of a face within a input image; 

Figure 13a shows three frames of an example source video 
sequence which is applied to the target video sequence 
generation unit shown in Figure 4; 

Figure 13b shows an example target image used to generate 
a set of difference parameters used by the target video 
sequence generation unit shown in Figure 4; 

Figure 13c shows a corresponding three frames from a 
target video sequence generated by the target video 
sequence generation unit shown in Figure 4 from the three 
frames of the source video sequence shown in Figure 13a 
using the difference parameters generated using the 
target image shown in Figure 13b; 



Figure 13d shows a second example of a target image used 
to generate a set of difference parameters for use by the 
target video sequence generation unit shown in Figure 4; 
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and 

Figure 13e shows the corresponding three frames from the 
target video sequence generated by the target video 
5 sequence generation unit shown in Figure 4 when the three 

frames of the source video sequence shown in Figure 13a 
are input to the target video sequence generation unit 
together with the difference parameters calculated using 
the target image shown in Figure 13d. 

10 

Figure 1 is an image processing apparatus according to an 
embodiment of the present invention. The apparatus 
comprises a computer 1 having a central processing unit 
(CPU) 3 connected to a memory 5 which is operable to 

15 store a program defining the sequence of operations of 

the GPU 3 and to store object and image data used in 
calculations by the CPU 3 . Coupled to an input port of 
the CPU 3 there is an input device 7, which in this 
embodiment comprises a keyboard and a computer mouse. 

20 Instead of, or in addition to the computer mouse, another 

position sensitive input device (pointing device) such as 
a digitiser with associated stylus may be used. 

A frame buffer 9 is also provided and is coupled to the 
25 CPU 3 and comprises a memory unit {not shown) arranged to 

store image data relating to at least one image, for 
example by providing one (or several) memory location(s) 
per pixel of the image. The value stored in the frame 
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buffer for each pixel defines the colour or intensity of 
that pixel in the image. In this embodiment, the images 
are represented by 2-D arrays of pixels, and are 
conveniently described in terms of Cartesian coordinates, 
so that the position of a given pixel can be described by 
a pair of x-y coordinates. This representation is 
convenient since the image is displayed on a raster scan 
display 11. Therefore, the x-coordinate maps to the 
distance along the line of the display and the y- 
coordinate maps to the number of the line. The frame 
buffer 9 has sufficient memory capacity to store at least 
one image. For example, for an image having a resolution 
of 1000 x 1000 pixels, the frame buffer 9 includes 10 s 
pixel locations, each addressable directly or indirectly 
in terms of a pixel coordinate x,y. 

In this embodiment, a video tape recorder (VTR) 13 is 
also coupled to the frame buffer 9, for recording the 
image or sequence of images displayed on the display 11. 
A mass storage device 15, such as a hard disc drive, 
having a high data storage capacity is also provided and 
coupled to the memory 5. Also coupled to the memory 5 is 
a floppy disc drive 17 which is operable to accept 
removable data storage media, such as a floppy disc. 19 
and to transfer data stored thereon to the memory 5. The 
memory 5 is also coupled to a printer 21 so that 
generated images can be output in paper form, an image 
input device 2 3 such as a scanner or video camera and a 
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modem 25 so that input images and output images can be 
received from and transmitted to remote computer 
terminals via a data network, such as the Internet. The 
CPU 3, memory 5, frame buffer 9, display unit 11 and mass 
5 storage device 13 may be commercially available as a 

complete system, for example as an IBM compatible 
personal computer (PC) or a workstation such as the Sparc 
station available from Sun Microsystems. 

10 A number of embodiments of the invention can be supplied 

commercially in the form of programs stored on a floppy 
disc 19 or on other mediums, or as signals transmitted 
over a data link, such as the Internet, so that the 
receiving hardware becomes reconfigured into an apparatus 

15 embodying the present invention. 



In this embodiment, the computer 1 is programmed to 
receive a source video sequence input by the image input 
device 23 and to generate a target video sequence from 

2 0 the source video sequence using a target image. In this 

embodiment, the source video sequence is a video clip of 
an actor acting out a scene, the target image is an image 
of a second actor and the resulting target video sequence 
is a video sequence showing the second actor acting out 

25 the scene. The way in which this is achieved will now be 

briefly described with reference to Figures 2 to 4 . 

In this embodiment, in order to generate the target video 
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sequence from the source video sequence, a hierarchical 
parametric appearance model which models the variability 
of shape and texture of the head images is used. This 
appearance model makes use of the fact that some prior 
5 knowledge is available about the contents of head images 

in order to facilitate their modelling. For example, it 
can be assumed that two frontal images of a human face 
will each include eyes, a nose and a mouth. In this 
embodiment, as shown in Figure 2, the hierarchical 
10 parametric appearance model 35 is generated by an 

appearance model generation unit 31 from training images 
which are stored in ah image database 32. In this 
embodiment, all the training images are colour images 
having 500 x 500 pixels, with each pixel having a red, 
15 green and a blue pixel value. The resulting appearance 

model 35 is a parameterisation of the appearance of the 
class of head images defined by the heads in the training 
images, so that a relatively small number of parameters 
(for example 50) can describe the detailed (pixel level) 
20 appearance of a head image from the class. In 

particular, the hierarchical appearance model 35 defines 
a function (F) such that: 



J=F(E) f 1 

25 

where g is the set of appearance parameters (written x 
vector notation) which generates, through th 
hierarchical appearance model (F) , the face image I. Th 
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structure of the hierarchical appearance model used in 
this embodiment will be described later. 



Once the hierarchical appearance model 35 has been 
5 determined, a target video sequence can be generated from 

a source video sequence. As shown in Figure 3, the 
source video sequence is input to a target video sequence 
generation unit 51 which processes the source video 
sequence using a set of difference parameters 53 to 

10 generate and to output the target video sequence. The 

difference parameters 53 are determined by subtracting 
the appearance parameters which are generated for the 
first actor's head in one of the source video frames , 
from the appearance parameters which are generated for 

15 the second actor's head in the target image. The way in 

which these appearance parameters are determined for 
these images will be described later. In order that 
these difference parameters only represent differences in 
the general shape and colour texture of the two actors' 

20 heads, the pose and facial expression of the first 

actor's head in the source video frame used should match , 
as closely as possible, the pose and facial expression of 
the second actor's head in the target image. 



25 The . processing steps required to generate the target 

video sequence from the source video sequence will now be 
described in more detail with reference to Figure 4. As 
shown, in step si, the appearance parameters (£>/) for 
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the first actor's head in the current video frame (1^) 
are automatically calculated. The way that this is 
achieved will be described later. Then, in step s3, the 
difference parameters (£di f ) are added to the appearance 
5 parameters for the first actor's head in the current 

video frame to generate: 

The resulting appearance parameters (fw 1 ) are then used, 
in step s5, to regenerate the head for the current target 
video frame. In particular , the modified appearance 
parameters are inserted into equation ( 1 ) above to 
regenerate a modified head image which is then 
composited, in step s7, into the source video frame to 
generate the corresponding target video frame. A check 
is then made, in step s9, to determine whether or not 
there are any more source video frames. If there are, 
then the processing returns to step si where the 
procedure described above is repeated for the next source 
video frame. If there are no more source video frames, 
then the processing ends. 

Figure 13 illustrates the results of this animation 
technique {although showing black and white images and 
not colour). In particular, Figure 13a shows three 
frames of the source video sequence, Figure 13b shows the 
target image {which in this embodiment is computer 
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generated) and Figure 13c shows the corresponding three 
frames of the target video sequence obtained in the 
manner described above. As can be seen, an animated 
sequence of the computer generated character has been 
5 generated from a video clip of a real person and a single 

image of the computer generated character. 

HIERARCHICAL APPEARANCE MODEL 

In the systems described by Cootes et al and Blanz et al, 
the parametric model is created by placing a number of 
landmark points on a training image and then identifying 
the same landmark points on the other training images in 
order to identify how the location of and the pixel 
values around the landmark points vary within the 
training images. A principal component analysis is then 
performed on the matrix which consists of vectors of the 
landmark points. This PCA yields a set of Eigenvectors 
which describe the directions of greatest variation along 
which the landmark points change. Their appearance model 
includes the linear combination of the Eigenvectors plus 
parameters for translation, rotation and scaling. This 
single appearance model relates a compact set of 
appearance parameters to pixel values. 

25 In this embodiment, rather than having a single 

appearance model for the object, a hierarchical 
appearance model comprising several appearance models 
which model variations in components of the object is 
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used. For example, in the case of human faces, the 
hierarchical appearance model may include an appearance 
model for the mouth, one for the left eye, one for the 
right eye and one for the nose. Since it may be possible 
5 to model various components of the object, the particular 

hierarchical structure which will be used for a 
particular object and application must first of all be 
defined by the system designer. 

10 Figure 5 schematically illustrates the structure of the 

hierarchical appearance model used in this embodiment. 
As shown, at the top of the hierarchy there is a general 
face appearance model 61. Beneath the face appearance 
model there is a mouth appearance model 63, a left eye 

15 appearance model 65, a right eye appearance model 67, a 

left eyebrow appearance model 69, a rest of left eye 
appearance model 71, a right eyebrow appearance model 73, 
a rest of right eye appearance model 75 and, in this 
embodiment, a facet appearance model for each facet 

20 defined in the training images. Figure 6 shows the head 

of a training image in which the set of landmark points 
has been placed at the appropriate points on the head. 
As shown, in this embodiment, there are one hundred and 
forty-eight triangular areas or facets defined by the 
25 positions of the landmark points. Therefore, in this 

embodiment, there are one hundred and forty-eight facet 
appearance models 77. 
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The face appearance model 61 operates to relate a small 
number of "global" appearance parameters to a further set 
of appearance parameters, some of which are input to 
facet appearance models 77, some of which are input to 
the mouth appearance model 63, some of which are input to 
the left eye appearance model 65 and the rest of which 
are input to the right eye appearance model 67. The 
facet appearance models 77 operate to relate the input 
parameters received from the appearance model which is 
above it in the hierarchy into corresponding pixel values 
for that facet- The mouth appearance model 63 is 
operable to relate the parameters it receives from the 
face appearance model 61 into a further set of appearance 
parameters, respective ones of which are output to the 
respective facet appearance models 77 for the facets 
which are associated with the mouth. Similarly, the left 
and right eye appearance models 65 and 67 operate to 
relate the parameters it receives from the face 
appearance model 61 into a further set of appearance 
parameters, some of which are input to the appropriate 
eyebrow appearance model and the rest of which are input 
to the appropriate rest of eye' appearance model . These 
appearance models in turn convert these parameters into 
parameters for input to the facet appearance models 
associated with the facets which appear in the left and 
right eyes respectively. In this way, a small compact 
set of "global" appearance parameters input to the face 
appearance model 61 can filter through the hierarchical 
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structure illustrated in Figure 5 to generate a set of 
pixel values for all the facets in a head which can then 
be used to regenerate the image of the head. 

5 The way in which the individual appearance models of this 

hierarchical appearance model are generated in this 
embodiment will now be described with reference to 
Figures 6 to 10. 

10 In this embodiment, .each of the training images stored in 

the image database 32 is labelled with eighty six 
landmark points. In this embodiment, this is performed 
manually by the user via the user interface 33. In 
particular, each training image is displayed on the 

15 display 11 and the user places the landmark points over 

the head in the training image. These points delineate 
the main features in the head, such as the position of 
the hairline, neck, eyes, nose, ears and mouth. In order 
to compare training faces, each landmark point is 

20 associated with the same point on each face. In this 

embodiment, the following landmark points are used: 



Landmark Point 


Associated Position j 


Landmark 
Point 


Associated Position 


LP X 


Left comer of left eye 


LP„ 


Eye, bottom 


LP 2 


Right comer of right 
eye 


LP, S 


Eye, top 


LP 3 


Chin, bottom 


LP 4 6 


Eye, bottom 


LP 4 


Right comer of left 
eye 


LP 47 


Eyebrow, lower 
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15 



25 



Landmark Point 


Associated Position 


Landmark 
Point 


Associated Position 


LP 5 


Left comer of right 
~ 


LP48 


Eyebrow, upper 




Mouth, left 




Cheek, left 


LP 7 


Mouth, right 


LP S0 


Cheek, right 


LP 8 


Nose, bottom 


LP S1 


Eyebrow, lower 


LP 9 


Nose, between eyes 


LP 5i 


Eyebrow, upper 


LP 10 


Upper hp, top 


LP53 




LP,, 


Lower lip, bottom 


LP54 




LPu : 


Neck, left, top 


LP55 




LP U 


Neckj right, top 


Lp 56 




LP, 4 


Pace edge left, level 
with nose 


LP S7 


Eyebrow, lower 


LP 15 


Face edge 


LP 58 


Eyebrow, upper 


LP 16 


Face edge right, level 
with nose 


LP„ 


Eyebrow, lower 


LP 17 


Face edge 


LP 60 


Eyebrow, upper 


LP, S 


Top of head 


LP 6 , 


Eyebrow, lower 


LP,,, 


Hair edge 


LP 62 


Lower lip, top 


LP 20 


Hair edge 


L? 63 


Centre forehead 


LP 2 i 


Hair edge 


LP„ 


Upper lip, top left 


LP 22 


Hair edge 


LP 6S 


Upper lip, top right 


LP 23 


Hair edge 


LP 66 


Lower lip, bottom right 




Hair edge 


L?67 


Lower lip, bottom left 


LP 2S 


Hair edge 


LP 6S 


Eye, top left 


LP 26 


Hair edge 


LP 69 


Eye, top right 


LP 27 


Hair edge 


LP,o 


Eye, bottom right 


LP* 


Hair edge 


LP 7 , 


Eye, bottom left 


LP» 


Bottom, far left 


LP 7Z 


Eye, top left 


LP 30 


Bottom, far right 


LP73 


Eye, top right 


LP„ 


Shoulder 


LP 74 


Eye, bottom right 
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Landmark Point 


— ~ ! 

Associated Position 

I 


Landmark 


Associated Position 


LP 3 2 


Shoulder j 


LP,* 


Eye, bottom left 


LP* 


Bottom, left j 


LP, 6 


Lower lip, top left 


LP M 


Bottom, middle 


LP77 


Lower lip, top right 


LP 35 


Bottom, right ] 


LP 7B 


Chin, left 


LP 36 


Left forehead 


LP 79 


Chin, right 


LP37 


Right forehead 


LPso 


Neck, left 


LP 3 s 


Centre, between 
eyebrows 


LPs: 


Neckline, left 


LP„ 


Nose, left 


LP B 2 


Neckline 


LP, 0 


Nose, right 


LPfJ3 


Neckline, right 


LP41 


Nose edge, left 




Neck, right 


LP« 


Nose edge, right 


LP 85 


Hair edge 


LP« 


Eye, lop 




Hair edge 



The result of the manual placement of the landmark points 
is a table of landmark points for each training image, 
which identifies the (x, y) coordinate of each landmark 
point within the image. As shown in Figure 6, these 
landmark points are also used to define the location of 
predetermined triangular facets or areas within the 
training image . 



FACET APPEARANCE MODEL 

Figure 7 shows a flow chart illustrating the 1 
processing steps involved in this embodiment 
determining a facet appearance model for facet (i). 
shown, in step s61, the system determines, for 
training image, the apex coordinates of facet (x) 
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texture values from within facet (i) . In order to sample 
texture from within the facet at corresponding points 
within each training facet, a transformation which 
transforms the facet onto a reference facet is 
5 determined. Figure 8 illustrates this transformation. 

In particular, Figure 8 shows facet f ± v taken from the 
V-th training image, which is defined by the landmark 
points (x^rYi"), (x 2 v ,y 2 v ) and (x 3 v ,y 3 v ) . The 
transformation (T ± v ) which transforms those coordinates 

10 onto coordinates (0,0), (1,0) and (0,1) is determined. 

In this embodiment, the texture information extracted 
from each training facet is defined by the regular array 
of pixels shown in the reference facet. In order to 
determine the corresponding red, green and blue pixel 

15 values in the training image, the inverse transformation 

([T/]- 1 ) is used to transform the pixel locations in the 
reference facet, into corresponding locations in the 
training facet, from which the RGB pixel values are 
determined, in this embodiment, this transformation may 

20 not result in an exact correspondence with a single image 

pixel location since the pixel resolution in the actual 
facet may be different to the resolution in the reference 
facet. In this embodiment, the texture information (RGB 
pixel values) which is determined is obtained by 

25 interpolating between the surrounding image RGB pixel 

values. In this embodiment, there are fifty pixels in 
the regular array of pixels in the reference facet. 
Therefore, fifty RGB pixel values are extracted for each 
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training facet. The texture information for facet (i) 
from the V-th training image can then be represented by 
a vector {t iv ) of the form: 

5 t iv = [t^t^t^ ... t 50 iv ] T 

where t x iv is the RGB texture information for the first 
reference pixel extracted from facet (i) in the V-th 
training image etc. 

10 

In this embodiment, the facet appearance models 77 treat 
shape and texture separately. Therefore, in step s63, 
the system performs a principal component analysis (PCA) 
on the set of texture training vectors generated in step 

15 s61. For a more detailed discussion of principal 

component analysis, the reader is referred to the book by 
W. J. Krzanowski entitled "Principles of Multivariate 
Analysis - A User's Perspective" 1998, Oxford Statistical 
Science Series. As those skilled in the art will 

20 appreciate, this principal component analysis determines 

all possible modes of variation within the training 
texture vectors. However, since each of the facets is 
associated with a similar point on the face, most of the 
variation within the data can be explained by a few modes 

25 of variation. The result of the principal component 

analysis is a facet texture appearance model (defined by 
matrix F t ) which relates a vector of facet texture 
parameters to a vector of texture pixel values, by: 
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2 rit = f.^iv.p) (3) 

where t iv is the RGB texture vector defined above, fT 1 is 
the mean RGB texture vector for facet (i), F ± is a matrix 
5 which defines the facet texture appearance model for 

facet {i) and %f Lt v is a vector of the facet texture 
parameters which describes the RGB texture vector t iv . 
The matrix F x describes the main modes of variation of 
the texture within the training facets; and the vector of 
10 facet texture parameters (2f L \) for a given input facet 

has a parameter associated with each mode of variation 
whose value relates the texture of the input facet to the 
corresponding mode of variation. 



15 As those skilled in the art will appreciate, for facets 

which describe fairly constant parts of the face, such as 
the chin or cheeks, very few parameters will be needed to 
model the variability within the training images. 
However, facets which are associated with areas of the 

20 face where there is a large amount of variability {such 

as facets which form part of the eye), will require a 
larger number of facet texture parameters to describe the 
variability within the training images. Therefore, in 
step s65, the system determines how many texture 

2 5 parameters are needed for the current facet and stores 

the appropriate facet appearance model matrix. 



In addition to being able to determine a set of texture 
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parameters ^ lt v for a given texture vector t lv , equation 
{ 3 ) can be solved with respect to the texture vector t lv 
to give: 

5 £ 1V = P-F/ D ? e < 4 > 

since F i F i ' r equals the identity matrix. Therefore, by 
modifying the set of texture parameters (r/"} within 
suitable limits, new textures for facet (i) can be 
10 generated which are similar to those in the training set. 

Once the above procedure has been performed for each of 
the one hundred and forty-eight facets in the training 
images, a facet texture appearance model will have been 

15 generated for each of those facets. In this embodiment, 

the facet appearance model does not compress the 
parameters defining the shape of the facets, since only 
six parameters are needed to define the shape of each 
facet - two parameters for each (x,y) coordinate of the 

20 facet's apexes. 

MOUTH APPEARANCE MODEL 

Figure 9 shows a flow chart illustrating the main 
processing steps required in order to generate the mouth 
25 appearance model 63. As shown, in step s67, the system 

uses the facet appearance models for the facets which 
form part of the mouth to generate shape and texture 
parameters from those facets for each training image. 
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Therefore, referring to Figure 10, the mouth appearance 
model 63 will receive texture and shape parameters from 
the facet appearance model for facet (i), facet { j ) and 
facet (n) for the corresponding facets in each of the 
5 training images 79. As illustrated in Figure 10, the 

appearance model for facet (i) is operable to generate, 
for each training image, six shape parameters 
(corresponding to the three (x,y) coordinates of the 
apexes of facet {!)) and six texture parameters. 

10 Similarly, the appearance model for facet (j) is operable 

to generate, for each training image, six shape 
parameters and four texture parameters and the appearance 
model for facet (n) is operable to generate, for each 
training image, six shape parameters and three texture 

15 parameters . 



The processing then proceeds to step s69 where the system 
performs a principal component analysis on the shape and 
texture parameters generated for the training images by 

20 the facet appearance models associated with the mouth. 

In this embodiment, the mouth appearance model 63 treats 
the shape and texture separately. In particular, for 
each training image, the system concatenates the six 
shape parameters for the facets associated with the mouth 

2 5 to form the following shape vector: 



pFKs = tXi £i f y ifi/ ^fi^fi^fi^fi : Xi fj ryi £ j/X2 f j/y2 fj 
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and concatenates the facet texture parameters output by 
the facet appearance models associated with the mouth to 
form the following texture vector: 

The system then performs a principal component analysis 
on the shape vectors generated by all the training images 
to generate a shape appearance model for the mouth 
(defined by matrix M 3 ) which relates each mouth shape 
vector to a corresponding vector of shape mouth 
parameters by: 

where r/ Ms v is the mouth shape vector for the mouth in the 
V-th training image, £™ a is the mean mouth shape vector 
from the training vectors and £% is a vector of mouth 
shape parameters for the mouth shape vector jfV The 
mouth shape model, defined by matrix M,, describes the 
main modes of variation of the shape of the mouths within 
the training images; and the vector of mouth shape 
parameters for the mouth in the V-th training image 

has a parameter associated with each mode of variation 
whose value relates the shape of the input mouth to the 
corresponding mode of variation. 

As with the facet appearance models, equation (5) above 
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can be rewritten with respect to the mouth shape vector 
3? K \ to give: 

p FMs = -FMs_ M ? n Ms 

since M s M S T equals the identity matrix. Therefore, by 
modifying the mouth shape parameters, new mouth shapes 
can be generated which will be similar to those in the 
training set. 

The system then performs a principal component analysis 
on the mouth texture parameter vectors (p FMt ) which are 
generated for the training images. This principal 
component analysis generates a mouth texture model 
(defined by matrix M*) which relates each of the facet 
texture parameter vectors for the facets associated with 
the mouth, to a corresponding vector of mouth texture 
parameters, by: 

Cf = M t { n ™-g™) (7 ) 

where g H \ is a vector of mouth facet texture parameters 
generated by the facet appearance models associated with 
the mouth for the mouth in the V-th training image; jf Mt 
is the mean vector of mouth facet texture parameters from 
the training vectors and is a vector of mouth texture 
parameters for the facet texture parameters r/ Mt v . The 
matrix M t describes the main modes of variation within 
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the training images of the facet texture parameters 
generated by the facet appearance models which are 
associated with the mouth; and the vector of mouth 
texture parameters (p_ Mt v) has a parameter associated with 
each of those modes of variation whose value relates the 
texture of the input mouth to the corresponding mode of 
variation. 

The processing then proceeds to step s71 shown in Figure 
9 where the system determines the number of shape 
parameters and texture parameters needed to describe the 
training data received from the facet appearance models 
which are associated with the mouth- As shown in Figure 
10, in this embodiment, the mouth appearance model 63 
requires five shape parameters and four texture 
parameters to be able to model most of this variation. 
The system therefore stores the appropriate mouth shape 
and texture appearance model matrices for subsequent use. 

As those skilled in the art will appreciate, a similar 
procedure is performed to determine each of the 
appearance models shown in Figure 5, starting from the 
facet appearance models at the base of the hierarchy. A 
further description of how these remaining appearance 
models are determined will, therefore, not be given here. 
The resulting hierarchical appearance model allows a 
small number of global face appearance parameters to be 
input to the face appearance model 61, which generates 
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further parameters which propagate down through the 
hierarchical model structure until facet pixel values are 
generated, from which an image which corresponds to the 
global appearance parameters can be generated. 

AUTOMATIC GENERA TION OF APPEARANCE PARAMETERS 
In the description given above of the way in which the 
appearance models are generated, appearance parameters 
for an image were generated from a manual placement of a 
number of landmark points over the image. However, 
during use of the appearance model to track the first 
actor's head in the source video sequence and during the 
calculation of the difference parameters the 
appearance parameters for the heads in the input images 
were automatically calculated. This task involves 
finding the set of global appearance parameters p. which 
best describe the pixels in view. This problem is 
complicated because the inverse of each of the appearance 
models in the hierarchical appearance model is not 
necessarily one-to-one. In this embodiment, the 
appearance parameters for the head in an input image are 
calculated in a two-step process. In the first step, an 
initial set of global appearance parameters for the head 
in the current frame (I/) is found using a simple and 
rapid technique. For all but the first frame of the 
source video sequence-, this is achieved by simply using 
the appearance parameters from the preceding video frame 
(I/- 1 ) before modification in step s3 (i.e. parameters 
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r^ 1 " 1 ). In this embodiment, the global appearance 
parameters (£) effectively define the shape and colour 
texture of the head. For the first frame and for the 
target image the initial estimate of the appearance 
5 parameters is set to the mean set of appearance 

parameters and the scale, position and orientation is 
initially estimated by the user manually placing the mean 
head over the head in the image. 

10 In the second step, an iterative technique is used in 

order to make fine adjustments to the initial estimate of 
the appearance parameters. The adjustments are made in 
an attempt to minimise the difference between the head 
described by the global appearance parameters (the model 

15 head) and the head in the current video frame (the image 

head). With 50 appearance parameters, this represents a 
difficult optimisation problem. This can be performed by 
using a standard steepest descent optimisation technique 
to iterative ly reduce the mean squared error between the 

20 given image pixels and those predicted by a particular 

set of appearance parameter values. In particular, 
minimising the following error function E(£): 



25 

where I* is a vector of actual image RGB pixel values at 
the locations where the appearance model predicts values 
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(the appearance model does not predict all pixel values 
since it ignores background pixels and only predicts a 
subsample of pixel values within the object being 
modelled) and F(p) is the vector of image RGB pixel 
5 values predicted by the hierarchical appearance model. 

As those skilled in the art will appreciate, E(u) will 
only be zero when the model head {i.e. F( p) ) predicts the 
actual image head exactly. Standard steepest 

descent optimisation techniques stipulate that a step in 
10 the direction -VE(£) should result in a reduction in the 

error function E(e), provided the error function is well 
behaved. Therefore, the change (Ag) in the set of 
parameter values should be; 

15 Ap = 2[VF{£}] T [T a -F{p)] ( 9 > 

which requires the calculation of the differential of the 
appearance model, i.e. VF(g). 

20 The technique described by Edwards et al assumes that, on 

average over the whole parameter space, VF(£) is 
constant. The update equation then becomes: 

A E = A[J a -F(£)] (10) 



for some constant matrix A (referred to as the "Active 
matrix") which is determined beforehand during a training 
routine. In this embodiment, rather than using a single 
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constant matrix associated with the entire hierarchical 
appearance model , an Active matrix is determined and used 
for each of the individual appearance models which form 
part of the hierarchical appearance model. The way in 
5 which these Active matrices are determined in this 

embodiment will now be described with reference to 
Figures 11a and lib, which illustrate the processing 
steps performed to generate the Active matrix for each 
facet appearance model and the Active matrix for the 
10 mouth appearance model. 

As shown in Figure 11a, in step s73, the system chooses 
a random facet parameter vector (rf 1 ) for the current 
facet (i) and then, in s75, perturbs this facet parameter 
15 vector by a small random amount to create + Ag*" 1 . In 

this embodiment, the facet parameter vectors include not 
only the texture parameters, but also the six shape 
parameters which define the (x,y) coordinates of the 
facet's location within the image. The processing then 
2 0 proceeds to step s7 7 where the system uses the parameter 

vector rf 1 and the perturbed parameter vector + to 
create model images J 0 Fi and I/ 1 respectively. The 
processing then proceeds to step s7 9 where the system 
records the parameter change Ar/ 1 and image difference If 1 
25 - j/ 1 . Then in step s81, the system determines whether 

or not there is sufficient training data for the current 
facet. If there is not then the processing returns to 
step s21. Once sufficient training data has been 
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generated, the processing proceeds to step s83 where the 
system performs multiple multivariate linear regressions 
on the data for the current facet to identify an Active 
matrix (A Fi ) for the current facet. 

5 

Figure lib shows the processing steps required to 
calculate the Active matrix for the mouth appearance 
model. As shown, in step s85, the system chooses a 
random mouth parameter vector jp". In this embodiment, 

10 this vector includes both the mouth shape parameters and 

the mouth texture parameters. Then, in step s87, the 
system perturbs this mouth parameter vector by a small 
random amount to create & + Ljf. The processing then 
proceeds to step s89 where the system uses the mouth 

15 parameter vectors and the perturbed mouth parameter 

vector p" + Ap_ M to create model images Io m and 
respectively, using the mouth appearance model and the 
facet appearance models associated with the mouth. The 
processing then proceeds to step s91 where the facet 

20 appearance models associated with the mouth are used 

again to transform the mouth model images I 0 m and If into 
corresponding facet appearance parameters p_ 0 FH and p^ 11 , 
which are then subtracted to determine the corresponding 
change Ap FH in the mouth facet parameters. The 

25 processing then proceeds to step s93 where the system 

records the mouth parameter change Ap M and the mouth 
facet parameter change Ap™. The processing then 
proceeds to step s95 where the system determines whether 
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or not there is sufficient training data. If there is 
not, then the processing returns to step s85. Once 
sufficient training data has been generated, the 
processing proceeds to step s97, where the system 
performs multiple multivariate linear regressions on the 
training data for the mouth to identify the Active matrix 
(Ah) for the mouth which relates changes in mouth 
parameters Ap_ M to changes in facet parameters &p FM for the 
facets associated with the mouth. 

As those skilled in the art will appreciate, a similar 
processing technique is used in ' order to identify the 
Active matrix for each of the appearance models shown in 
Figure 5 . 

Once the Active matrices have been determined for the 
hierarchical appearance model, they can then be used to 
iteratively update a current estimate of a set of 
appearance parameters for an input image. Figure 12 
illustrates the processing steps performed in this 
iterative routine for the current source video frame. As 
shown, in step slOl, the system initially estimates a set 
of global parameters for the head in the current source 
video frame. The processing then proceeds to step sl03 
where the system generates a model image from the 
estimated global parameters and the hierarchical 
appearance model. The system then proceeds to step sl05 
where it determines the image error between the model 
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image and the current source video frame. Then, in step 
si 0 7, the system uses this image error to propagate 
parameter changes up the hierarchy of the hierarchical 
appearance model using the stored Active matrices to 
determine a change in the global parameters. This change 
in global parameters is then used, in step sl09, to 
update the current global parameters for the current 
source video frame. The system then determines, in step 
sill, whether or not convergence has been reached by 
comparing the error obtained from equation (8) using the 
updated global parameters with a predetermined threshold 
(Th). If convergence has not been reached, then the 
processing returns to step sl03. Once convergence is 
reached, the processing proceeds to step sll3, where the 
current global appearance parameters are output as the 
global appearance parameters for the current source video 
frame and then the processing ends. 

ALTERNATIVE EMBODIMENTS 

In the above embodiment, the same hierarchical model 
structure was used to model the variation in the shape 
and texture within the training images. As those skilled 
in the art will appreciate, one model hierarchy can be 
used to model the shape variation and a different model 
hierarchy can be used to model the texture variation. 
Alternatively still, rather than separating the shape and 
texture parameters, each of the appearance models within 
the hierarchical model may model the combined variation 
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of the shape and texture within the training images. 

In the above embodiments, a facet appearance model was 
generated for each facet defined within the training 
5 images. As those skilled in the art will appreciate, 

many of the facets may be grouped together such that a 
single facet appearance model is generated for those 
facets. In one form of such an embodiment, a single 
facet appearance model may be determined which models the 
10 variability of texture within each facet of the training 

images . 

In the above embodiments , the same amount of texture 
information was extracted from each facet within the 
training images. In particular, fifty RGB texture values 
were extracted from each training facet. In an 
alternative embodiment, the amount of texture information 
extracted from each facet may vary in dependence upon the 
size of the facet. For example, more texture information 
may be extracted from larger facets or more texture 
information may be extracted from facets associated with 
important features of the face, such as the mouth, eyes 
or nose. 

25 In the above embodiments, each appearance model was 

determined from a principal component analysis of a set 
of training data. This principal component analysis 
determines a linear relationship between the training 



15 



20 
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data and a set of model parameters. As those skilled in. 
the art will appreciate, techniques other than principal 
component analysis can be used to determine a parametric 
model which relates a set of parameters to the training 
data. This model may define a non-linear relationship 
between the training data and the model parameters. For 
example, one or more of the models within the hierarchy 
may comprise a neural network which relates the set of 
input parameters to the training data. 

In the above embodiments, a principal component analysis 
was performed on a set of training data in order to 
identify a relatively small number of parameters which 
describe the main modes of variation within the training 
data. This allows a relatively small number of input 
parameters to be able to generate a larger set of output 
parameters from the model. However, as those skilled in 
the art will appreciate, this is not essential. One or 
more of the appearance models may act as transformation 
models in which the number input parameters is the same 
as or greater than the number of output parameters. This 
can be used to generate a set of input parameters which 
can be changed by the user in some intuitive way. For 
example, in order to identify parameters which have a 
linear relationship with features in the object, such as 
a parameter that linearly changes the amount of smile 
within a face image. 
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In the above embodiments, a set of Active matrices were 
used in order to identify automatically a set of 
appearance parameters for an input image. As those 
skilled in the art will appreciate, rather than having 
5 separate Active matrices for each of the components in 

the hierarchical appearance model, a global Active matrix 
may be used instead. Further, although both the shape 
and grey level parameters were used in order to derive 
the Active matrices, suitable Active matrices can be 
10 determined using just the shape information. 



In the above embodiments, the variation in both the shape 
and texture within the training images were modelled. As 
those skilled in the art will appreciate, this 
hierarchical modelling technique can be used to model 
only the shape of the objects within the training images. 
Such a shape model could be then used to track objects 
within a video sequence. 



In the first embodiment, the target image illustrated a 
computer generated head. This is not essential. For 
example, the target image might be a hand-drawn head or 
an image of a real person. Figures 13d and 13e 
illustrate how an embodiment with a hand-drawn character 
might be used in character animation. In particular, 
Figure 13d shows a hand-drawn sketch of a character 
which, when combined with the images from the source 
video sequence (some of which are shown in Figure 13a) 
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generate a -target video sequence, some frames of which 
are shown in Figure 13e. As can be seen from a 
comparison of the corresponding frames in the source and 
target video frames, the hand-drawn sketch has been 
5 animated automatically using this technique. As those 

skilled in the art will appreciate, this is a much 
quicker and simpler technique for achieving computer 
animation as compared with existing systems which require 
the animator to manually create each frame of the 
10 animation. In particular, in this embodiment, all that 

is required is a video sequence of a real life actor 
acting out the scene to be animated, together with a 
single sketch of the character to be animated. 



15 The above embodiment has described the way in which a 

target image can be used to modify a source video 
sequence. In order to do this, a set of appearance 
parameters has to be automatically calculated for each 
frame in the video sequence. This involved the use of a 

20 number of Active matrices which relate image errors to 

appearance parameter changes. As those skilled in the 
art will appreciate, similar processing is required in 
other applications, such as the tracking of an object 
within a video sequence, the tracking of a human face 

25 within a video sequence or the tracking of a knee joint 

in an MRI scan. 



In the above embodiment, the appearance model was used to 
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model the variations in facial expressions and 3D pose of 
human heads. As those skilled in the art will 
appreciate, the appearance model can be used to model the 
appearance of any deformable object such as parts of the 

5 body and other animals and objects. For example, the 

above techniques can be used to track the movement of 
lips in a video sequence. Such an embodiment could be 
used in film dubbing applications in order to synchronise 
the lip movements with the dubbed sound. This animation 

10 technique might also be used to give animals and other 

objects human-like characteristics by combining images of 
them with a video sequence of an' actor. This technique 
can also be used for monitoring the shape and appearance 
of objects passing along a production line for quality 

15 control purposes. 

In the above embodiment, the appearance model was 
generated by using a principal component analysis of 
shape and texture data which is extracted from the 

20 training images. As those skilled in the art will 

appreciate, by modelling the features of the training 
heads in this way, it is possible to accurately model 
each head by just a small number of parameters. However, 
other modelling techniques, such as vector quantisation 

25 and wavelet techniques can be used. 

In the above embodiments, the training images used to 
generate the appearance model were all colour images in 
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which each pixel had an RGB value. As those skilled in 
the art will appreciate, the way in which the colour is 
represented in this embodiment is not important. In 
particular, rather than each pixel having a red, green 
5 and blue value, they might be represented by a 

chrominance and a luminance component or by hue, 
saturation and value components. Alternatively still, 
the training images may be black and white images, in 
which case only grey level data would be extracted from 
10 the facets in the training images. Additionally, the 

resolution of each training image may be different. 

In the above embodiment, during the automatic generation 
of the appearance parameters, and in particular during 

15 the iterative updating of these appearance parameters the 

error between the input image and the model image was 
generated using the appearance model. Since this 
iterative technique still requires a relatively accurate 
initial estimate for the appearance parameters, it is 

20 possible initially to perform the iterations using lower 

resolution images and once convergence has been reached 
for the lower resolutions to then increase the resolution 
of the images and to repeat the iterations for the higher 
resolutions. In such an embodiment, separate Active 

25 matrices would be required for each of the resolutions. 



In the above embodiment, the difference parameters were 
determined by comparing the image of the first actor from 
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image of the second actor in the target image. In an 
alternative embodiment, a separate image of the first 
actor may be provided which does not form part of the 
5 source video sequence. 

In the above embodiments, each of the appearance models 
modelled variations in two-dimensional images-. The above 
modelling technique could be adapted to work with 3D 

10 images and animations. In such an embodiment, the 

training images used to generate the appearance model 
would normally include 3D images' instead of 2D images. 
The three-dimensional models may be obtained using a 
three dimensional scanner which typically work either by 

15 using laser range-finding over the object or by using one 

or more stereo pairs of cameras. Once a 3D hierarchical 
appearance model has been created from the training 
models, new 3D models can be generated by adjusting the 
appearance parameters and existing 3D models can be 

20 animated using the same differencing technique that was 

used in the two-dimensional embodiment described above. 
This 3D model can then be used to track 3D objects 
directly within a 3D animation.' Alternatively, a 2D 
model may be used to track the 3D object within a video 

25 sequence and then use the result to generate 3D data for 

the tracked object. 

In the above embodiment, a set of difference parameters 
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were identified which describe the main differences 
between the head in the video sequence and the head in 
the target image, which difference parameters were used 
to modify the video sequence so as to generate a target 
5 video sequence showing the second head. In the 

embodiment, the set of difference parameters were added 
to a set of appearance parameters for the current frame 
being processed. In an alternative embodiment, the 
difference parameters may be weighted so that, for 
10 example, the target video sequence shows a head having 

characteristics from both the first and second actors. 

In the above embodiment, a hierarchical appearance model 
is used to model the appearance of human faces. The 

15 model is then used to modify a source video sequence 

showing a first actor performing a scene to generate a 
target video sequence showing a second actor performing 
the same scene. As those skilled in the art will 
appreciate , the hierarchical model presented above can be 

20 used in various other applications. For example, the 

hierarchical appearance model can be used for synthetic 
two-dimensional or three-dimensional character 
generation; video compression when the video is 
substantially that of an object which is modelled by the 

25 appearance model; object recognition for security 

purposes; face tracking for human performance analysis or 
human computer interaction and the like; 3D model 
generation from two-dimensional images; and image editing 
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(for example making people look older or younger, fatter 
or thinner etc) . ■ - 

In the above embodiment, an iterative process was used to 
5 update an estimated set of appearance parameters for an 

input image. This iterative process continued until an 
error between the actual image and the image predicted by 
the model was below a predetermined threshold. In an 
alternative embodiment, where there is only a 
10 predetermined amount of time available for determining a 

set of appearance parameters for an input image, this 
iterative routine may be performed for a predetermined 
period of time or for a predetermined number of 
iterations . 



15 
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CLAIMS : 

1 . A parametric model for modelling the shape of an 
object, the model comprising: 
5 data defining a function which relates a set of 

input parameters to a set of locations which identify the 
relative positions of a plurality of predetermined points 
on the object; 

characterised in that said data defines a 

10 hierarchical set of functions in which a function in a 

top layer of the hierarchy is operable to generate a set 
of output parameters from a set of input parameters and 
in which one or more functions in a bottom layer of the 
hierarchy are operable to receive parameters output from 

15 one or more functions from a higher layer of the 

hierarchy and to generate therefrom at least some of said 
locations which identify the relative positions of said 
predetermined points. 

20 2. A model according to claim 1, wherein said hierarchy 

comprises one or more intermediate layers of functions 
which are operable to receive parameters output from one 
or more functions from a higher layer of the hierarchy 
and to generate therefrom a set of output parameters for 

25 input to functions in a lower layer of the hierarchy. 



3. A model according to claim 1 or 2 , for modelling the 
two-dimensional shape of the object by identifying the 
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relative positions of said predetermined points in a 
predetermined plane. 

4. A model according to claim 1 or 2 , for modelling the 
three-dimensional shape of the object by identifying the 
relative positions of the predetermined points in a 
three-dimensional space. 

5 . A model according to any preceding claim, wherein 
one or more of said functions comprises a linear function 
which linearly relates the input parameters to the 
function to the output parameters of the function. 

6. A model according to claim 5, wherein said one or 
more linear functions are identified from a principal 
component analysis of training data derived from a set of 
training objects. 

7. A model according to any preceding claim, wherein 
one or more of said functions are non-linear. 

8 . A model according to claim 7 , wherein at least one 
of said non-linear functions comprises a neural network. 

9. A model according to any preceding claim, wherein 
the number of parameters input to at least one of said 
functions is smaller than the number of parameters output 
from the function. 
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10. A model according to any preceding claim, wherein 
the number of input parameters to at least one of said 
functions is greater than or equal to the number of 
parameters output by the function. 

5 

11. A model according to any preceding claim for 
modelling the shape and texture of the object, the model 
further comprising data defining a hierarchical set of 
functions in which a function in a top layer of the 

10 hierarchy is operable to generate a set of output 

parameters from a set of input parameters and in which 
one or more functions in a bottom* layer of the hierarchy 
are operable to receive parameters output from one or 
more functions from a higher layer of the hierarchy and 

15 to generate therefrom texture information for the object. 

12. A model according to claim 11, wherein the texture 
hierarchy has the same structure as the shape hierarchy. 

20 13. A model according to claim 11 or 12, wherein one or 

more of said functions are operable to relate an input 
set of shape and texture parameters to an output set of 
appearance parameters defining both shape and texture. 

25 14. A model according to any preceding claim, wherein 

said object is a deformable object. 

15. A model according to claim 14, wherein said 
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deformable object includes a human face. 

16. A model according to claim 15, wherein said function 
in said top layer of the hierarchy models the shape of 

5 the entire face and wherein said hierarchy includes a 

function which models the shape of the mouth. 

17. A model according to claim 16, wherein said 
hierarchy further comprises a function for modelling the 

10 shape of the eyes. 

18. A model according to any preceding claim, wherein 
the or each function in the bottom layer of the hierarchy 
identifies the positions of a plurality of predetermined 

15 points according to a predefined function of smaller 

number of control point positions. 

19. A model according to claim 18, wherein the 
predefined function for each of the plurality of points 

20 is a linear mapping of the control point positions and 

the control points are the three corners of a triangular 
facet. 

20. A model according to claim 18, wherein the 
25 predefined function for each of the plurality of points 

is a predefined non-linear mapping of a fixed number of 
control point positions. 
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21. A model according to claim 18, ■ wherein the 
predefined function for each of the plurality of points 
is a predefined displacement from a single control point. 

5 22. A method of determining a set of appearance 

parameters representative of the appearance of an object, 
the method comprising the steps of: 

(i) storing a parametric model according to any of 
claims 1 to 21 which relates a set of input parameters to 

10 appearance data representative of the appearance of the 

object; 

(ii) storing at least one function which relates a 
change in the input parameters to an error between actual 
appearance data for the object and appearance data 

15 determined from the set of input parameters and said 

parametric model; 

(iii) initially estimating a current set of input 
parameters for the object; 

(iv) determining appearance data for the object from 
20 the current set of input parameters and the stored 

parametric model; 

(v) determining the error between actual appearance 
data of the object and the appearance data determined 
from the current set of input parameters; 

25 (vi) determining a change in the input parameters 

using said at least one stored function and said 
determined error; and 

(vii) updating the current set of input parameters 
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with the determined change in the input parameters. 



23. A method according to claim 22, further comprising 
the step of repeating steps (iv) to (vii) until the error 
5 determined in step (v) is less than a predetermined 

threshold. 



24. A method according to claim 22, further comprising 
the step of repeating steps (iv) to (vii) for a 
10 predetermined amount of time or for a predetermined 

number of repetitions. 



25. A method according to claim 22, 23 or 24, wherein 
said second storing step stores a plurality of functions, 

15 one associated with each function within the hierarchical 

model . 

26. A method of tracking an object comprising the steps 
of: 

20 (i) storing a parametric model according to any of 

claims 1 to 21 which relates a set of input parameters to 
appearance data representative of the appearance of the 
object; 

(ii) storing at least one function which relates a 
25 change in the input parameters to an error between the 

actual appearance data for the object and the appearance 
data determined from the set of input parameters and said 
parametric model; 
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{iii) initially estimating a current set of input 
parameters for the object; 

(iv) determining the appearance data for the object 
from the current set of input parameters and the stored 
parametric model; 

(v) determining an error between the actual 
appearance data for the object and the appearance data 
for the object determined from the current set of input 
parameters ; 

(vi) determining a change in the input parameters 
using the at least one stored function and the determined 
error; 

(vii) updating the current set of input parameters 
with said change in the input parameters; 

{viii) repeating steps (iv) to (vii) in order to 
reduce the error determined in step (v) ; and 

(ix) repeating steps (iii) to (viii) to track the 
object. 

27. An apparatus for determining a set of appearance 
parameters representative of the appearance of an object, 
the apparatus comprising: 

means for storing (i) a parametric model according 
to any of claims 1 to 21 which relates a set of input 
parameters to appearance data representative of the 
appearance of the object; and (ii) at least one function 
which relates a change in the input parameters to an 
error between actual appearance data for the object and 
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the appearance data for the object determined from the 
set of input parameters and said parametric model; 

means for receiving an initial estimate of a current 
set of input parameters for the object; 

means for updating the current set of input 
parameters comprising: 

(i) means for determining appearance data for the 
object from the current set of input parameters and the 
stored parametric model; 

<ii) means for determining the error between the 
actual appearance data for the object and the appearance 
data for the object determined from the current set of 
input parameters ; 

(iii) means for determining a change in the input 
parameters using said at least one stored function and 
said determined error; and 

(iv) means for updating the current set of input 
parameters with the determined change in the input 
parameters . 

28. An apparatus according to claim 27, wherein said 
updating means is operable to update iteratively the 
current set of input parameters until the error 
determining means determines an error which is less than 
a predetermined threshold. 

29. An apparatus according to claim 27 or 28, wherein 
said storing means stores a plurality of functions, one 
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associated with each function within the hierarchical 
model . 

30. An apparatus for tracking an object comprising: 

means for storing (i) a parametric model according 
to any of claims 1 to 21 which relates a set of input 
parameters to appearance data representative of the 
appearance of the object; and (ii) at least one function 
which relates a change in the input parameters to an 
error between actual appearance data for the object and 
the appearance data for the object determined from the 
set of input parameters and said parametric model; 

means for receiving an initial estimate of a current 
set of input parameters for the object; 

means for updating the current set of input 
parameters comprising: 

(i) means for determining appearance data for the 
object from the current set of input parameters and the 
stored parametric model; 

(ii) means for determining an error between actual 
appearance data for the object and the appearance data 
for the object determined from the current set of input 
parameters ; 

(iii) means for determining a change in the input 
parameters using the at least one stored function and the 
determined error; and 

(iv) means for updating the current set of input 
parameters with said change in the input parameters; 
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wherein said updating means is operable to update 
iteratively the current set of input parameters in order 
to reduce the determined error, wherein said receiving 
means is operable to receive further estimates of the 
5 current input parameters and wherein said update means is 

operable to update the received estimates of the current 
input parameters in order to track said object. 

31. A storage medium storing the parametric model 
10 according to any of claims 1 to 21 or storing processor 

implementable instructions for controlling a processor to 
implement the method of any one of claims 22 to 26. 

32. Processor implementable instructions for controlling 
15 a processor to implement the method of any one of claims 

22 to 26. 
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IMAGE PROCESSING SYSTEM 

The present invention relates to the parametric modelling 
of the appearance of objects. The resulting model can be 
5 used, for example, to track the object, such as a human 

face, in a video sequence. 



The use of parametric models for image interpretation and 
synthesis has become increasing popular. Cootes et al 
have shown in their paper entitled "Active Shape Models - 
Their Training and Application", Computer Vision and 
Image Understanding, Volume 61, No. 1, January, pages 38- 
59, 1995, how such parametric models can be used to model 
the variability of the shape and texture of human faces. 
They have mainly used these models for face recognition 
and tracking within video sequences, although they have 
also demonstrated that their model can be used to model 
the variability of other deformable objects, such as MRI 
scans of knee joints. The use of these models provides 
a basis for a broad range of applications since they 
explain the appearance of a given image in terms of a 
compact set of model parameters which can be used for 
higher levels of interpretation of the image. For 
example, when analysing face images, they can be used to 
characterise the identity, pose or expression of a face. 



Using such models for image interpretation requires, 
however, a method of fitting them to new image data. 
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This involves identifying the model parameters that 
generate an image which best fits {according to some 
measure) the new input image. Typically this problem is 
one of minimising the sum of squares of pixel errors 
5 between the generated image and the input image . In 

their paper entitled "Estimating Coloured 3D Face Models ; 
from Single Images: An Example-Based Approach" Vetter and 
Blanz have proposed a stochastic gradient descent 
optimisation technique to identify the optimum model 
10 parameters for the new image. Although this technique 

can give very accurate results finding the locally 
optimal solution, they generally get stuck in local 
minima since the error surface for the problem of fitting 
an appearance model to an image is particularly rough 
15 containing many local minima. Therefore, this 

minimisation technique often fails to converge on the 
global minimum. An additional drawback of this technique 
is that it is very slow requiring several minutes to 
achieve convergence. 

20 

A faster, more robust technique known as the active 
appearance model was proposed by Edwards et al in the 
paper entitled "Interpreting Face Images using Active 
Appearance Models", published in the Third international 
25 Conference on Automatic Face and Gesture Recognition 

1998, pages 300-305, Japan, April 1998. This technique 
uses a prior training stage in which the relationship 
between model parameter displacements and the resulting 
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change in image error is learnt. Although the method is 
much faster than direct optimisation techniques, it also 
requires fairly accurate initial model parameters if the 
search is to converge. Additionally, this technique does 
not guarantee that the optimum parameters will be found. 

The appearance model proposed by Cootes et al includes a 
single appearance model matrix which linearly relates a 
set of parameters to corresponding image data. Blanz et 
al segmented the face into a number of completely 
independent appearance models , each of which is used to 
render a separate region of the face. The results are 
then merged using a general interpretation technique. 

The present invention aims to provide an alternative way 
of modelling the appearance of objects which will allow 
subsequent image interpretation through appropriate 
processing of parameters generated for the image. 

According to one aspect, the present invention provides 
a hierarchical parametric model for modelling the shape 
of an object, the model comprising data defining a 
hierarchical set of functions in which a function in a 
top layer of the hierarchy is operable to generate a set 
of output parameters from a set of input parameters and 
in which one or more functions in a bottom layer of the 
hierarchy are operable to receive parameters output from 
one or more functions from a higher layer of the 
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hierarchy and to generate therefrom the relative 
positions of a plurality of predetermined points on the 
object. Such a hierarchical parametric model has the 
advantage that small changes in some parts of the object 
can still be modelled by the parameters, even though they 
are significantly smaller than variations in other less 
important parts of the object. This model can be used 
for face tracking, video compression, 2D and 3D character 
generation, face recognition for security purposes, image 
editing etc. 

According to another aspect, the present invention 
provides an apparatus and method of determining a set of 
appearance parameters representative of the appearance of 
an object, the method comprising the steps of storing a 
hierarchical parametric model such as the one discussed 
above and at least one function which relates a change in 
input parameters to an error between actual appearance 
data for the object and appearance data determined from 
the set of input parameters and the parametric model; 
initially receiving a current set of input parameters for 
the object; determining appearance data for the object 
from the current set of input parameters and the stored 
parametric model; determining the error between the 
actual appearance data of the object and the appearance 
data determined from the current set of input parameters; 
determining a change in the input parameters using the at 
least one stored function and said determined error; and 
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updating the current set of input parameters with the 
detormined change in the input parameters . 

An exemplary embodiment of the present invention will now 
5 be described with reference to the accompanying drawings 

in which: 



Figure 1 is a schematic block diagram illustrating a 
general arrangement of a computer system which can be 
10 programmed to implement the present invention; 

Figure 2 is a block diagram of an appearance model 
generation unit which receives some of the image frames 
of a source video sequence together with a target image 
15 frame and generates therefrom an appearance model; 

Figure 3 is a block diagram of a target video sequence 
generation unit which generates a target video sequence 
from a source video sequence using a set of stored 
20 difference parameters; 



Figure 4 is a flow chart illustrating the processing 
steps which the target video sequence generation unit 
shown in Figure 3 performs to generate the target video 
25 sequence; 



Figure 5 schematically illustrates the form of a 
hierarchical appearance model generated in one embodiment 
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of the invention; 

Figure 6 shows a head with a mesh of triangular facets 
placed over the head and whose positions are defined by 
5 the position of landmark points at the corners of the 

facets ; 

Figure 7 is a flow chart illustrating the processing 
steps required to generate a facet appearance model from 
10 the training images; 

Figure 8 schematically illustrates the way in which a 
transformation is defined between a facet in a training 
image and a predefined shape of facet which allows 
15 texture information to be extracted from the facet; 

Figure 9 is a flow chart illustrating the main processing 
steps involved in determining an appearance model for the 
mouth using the appearance models for the facets which 
20 appear in the mouth and using the training images; 

Figure 10 schematically illustrates the way in which 
training images are used to determine some of the 
appearance models which form the hierarchical appearance 
25 model illustrated in Figure 5; 

Figure 11a is a flow chart illustrating the processing 
steps performed during a training routine to identify an 
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Active matrix associated with a current facet; 

Figure lib is a flow chart illustrating the processing 
steps performed during a training routine to identify an 
5 Active matrix associated with the mouth; 

Figure 12 is a flow chart illustrating the processing 
steps involved in determining a set of parameters which 
define the appearance of a face within a input image; 

10 

Figure 13a shows three frames of an example source video 
sequence which is applied to the target video sequence 
generation unit shown in Figure 4; 

15 Figure 13b shows an example target image used to generate 

a set of difference parameters used by the target video 
sequence generation unit shown in Figure 4; 

Figure 13c shows a corresponding three frames from a 
20 target video sequence generated by the target video 

sequence generation unit shown in Figure 4 from the three 
frames of the source video sequence shown in Figure 13a 
using the difference parameters generated using the 
target image shown in Figure 13b; 



Figure 13d shows a second example of a target image used 
to generate a set of difference parameters for use by the 
target video sequence generation unit shown in Figure 4; 
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and 

Figure 13e shows the corresponding three frames from the 
target video sequence generated by the target video 
sequence generation unit shown in Figure 4 when the three 
frames of the source video sequence shown in Figure 13a 
are input to the target video sequence generation unit 
together with the difference parameters calculated using 
the target image shown in Figure 13d. 

Figure 1 is an image processing apparatus according to an 
embodiment of the present invention. The apparatus 
comprises a computer 1 having a central processing unit 
(CPU) 3 connected to a memory 5 which is operable to 
store a program defining the sequence of operations of 
the CPU 3 and to store object and image data used in 
calculations by the CPU 3. Coupled to an input port of 
the CPU 3 there is an input device 7, which in this 
embodiment comprises a keyboard and a computer mouse. 
Instead of, or in addition to the computer mouse, another 
position sensitive input device (pointing device) such as 
a digitiser with associated stylus may be used. 

A frame buffer 9 is also provided and is coupled to the 
CPU 3 and comprises a memory unit (not shown) arranged to 
store image data relating to at least one image, for 
example by providing one (or several) memory locations ) 
per pixel of the image. The- value stored, in the frame 
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buffer for each pixel defines the colour or intensity of 
that pixel in the image. In this embodiment, the images 
are represented by 2-D arrays of pixels, and are 
conveniently described in terms of Cartesian coordinates, 
5 so that the position of a given pixel can be described by 

a pair of x-y coordinates. This representation is 
convenient since the image is displayed on a raster scan 
display 11. Therefore, the x-coordinate maps to the 
distance along the line of the display and the y- 

10 coordinate maps to the number of the line. The frame 

buffer 9 has sufficient memory capacity to store at least 
one image. For example, for an image having a resolution 
of 1000 x 1000 pixels, the frame buffer 9 includes 10 s 
pixel locations, each addressable directly or indirectly 

15 in terms of a pixel coordinate x,y. 



In this embodiment, a video tape recorder (VTR) 13 is 
also coupled to the frame buffer 9, for recording the 
image or sequence of images displayed on the display 11. 

20 A mass storage device 15, such as a hard disc drive, 

having a high data storage capacity is also provided and 
coupled to the memory 5. Also coupled to the memory 5 is 
a floppy disc drive 17 which is operable to accept 
removable data storage media, such as a floppy disc 19 

25 and to transfer data stored thereon to the memory 5. The 

memory 5 is also coupled to a printer 21 so that 
generated images can be output in paper form, an image 
input device 2 3 such as a scanner or video camera and a 
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modem 25 so that input images and output images can be 
received from and transmitted to remote computer 
terminals via a data network, such as the Internet. The 
CPU 3, memory 5, frame buffer 9, display unit 11 and mass 
5 storage device 13 may be commercially available as a 

complete system, for example as an IBM compatible 
personal computer (PC) or a workstation such as the Sparc 
station available from Sun Microsystems. 

10 A number of embodiments of the invention can be supplied 

commercially in the form of programs stored on a floppy 
disc 19 or on other mediums, or as signals transmitted 
over a data link, such as the Internet, so that the 
receiving hardware becomes reconfigured into an apparatus 

15 embodying the present invention. 

In this embodiment, the computer 1 is programmed to 
receive a source video sequence input by the image input 
device 23 and to generate a target video sequence from 

20 the source video sequence using a target image. In this 

embodiment, the source video sequence is a video clip of 
an actor acting out a scene, the target image is an image 
of a second actor and the resulting target video sequence 
is a video sequence showing the second actor acting out 

25 the scene. The way in which this is achieved will now be 

briefly described with reference to Figures 2 to 4. 

In this embodiment, in order to generate the target video 
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sequence from the source video sequence, a hierarchical 
parametric appearance model which models the variability 
of shape and texture of the head images is used. This 
appearance model makes use of the fact that some prior 
5 knowledge is available about the contents of head images 

in order to facilitate their modelling. For example, it 
can be assumed that two frontal images of a human face 
will each include eyes, a nose and a mouth. In this 
embodiment, as shown in Figure 2, the hierarchical 
10 parametric appearance model 35 is generated by an 

appearance model generation unit 31 from training images 
which are stored in an image database 32. In this 
embodiment, all the training images are colour images 
having 500 x 500 pixels, with each pixel having a red, 
.15. ....... green and a blue pixel value. The resulting appearance 

model 35 is a parameterisation of the appearance of the 
class of head images defined by the heads in the training 
images, so that a relatively small number of parameters 
{for example 50) can. describe the detailed {pixel level) 
20 appearance of a head image from the class. In 

particular, the hierarchical appearance model 35 defines 
a function (F) such that: 

I = F(c) (1) 

25 



where £ is the set of appearance parameters (written in 
vector notation) which generates, through the 
hierarchical appearance model (F) , the face image I. The 
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structure of the hierarchical appearance model used in 
this embodiment will be described later. 



Once the hierarchical appearance model 35 has been 
5 determined, a target video sequence can be generated from 

a source video sequence. As shown in Figure 3, the 
source video sequence is input to a target video sequence 
generation unit 51 which processes the source Video 
sequence using a set of difference parameters 53 to 
10 generate and to output the target video sequence. The 

difference parameters 53 are determined by subtracting 
the appearance parameters which are generated for the 
first actor's head in one of the source video frames, 
from the appearance parameters which are generated for 
15 the second actor's head in the target image. The way in 

which these appearance parameters are determined for 
these images will be described later. In order that 
these difference parameters only represent differences in 
the general shape and colour texture of the two actors ' 
2 0 heads, the pose and facial expression of the first 

actor's head in the source video frame used should match, 
as closely as possible, the pose and facial expression of 
the second actor's head in the target image. 



25 The processing steps required to generate the target 

video sequence from the source video sequence will now be 
described in more detail with reference to Figure 4. As 
shown, in step si, the appearance parameters ("pa' 1 ) for 
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the first actor's head in the current video frame (V) 
are automatically calculated. The way that this is 
achieved will be described later. Then, in step s3, the 
difference parameters (r^) are added to the appearance 
parameters for the first actor's head in the current 
video frame to generate: 



(2) 



The resulting appearance parameters (g*^) are then used, 
in step s5, to regenerate the head for the current target 
video frame. In particular, the modified appearance 
parameters are inserted into equation (1) above to 
regenerate a modified head image which is then 
composited, in step s7, into the source video frame to 
generate the corresponding target video frame. A check 
is then made, in step s9, to determine whether or not 
there are any more source video frames. If there are, 
then the processing returns to step si where the 
procedure described above is repeated for the next source 
video frame. If there are no more source video frames, 
then the processing ends. 



Figure 13 illustrates the results of this animation 
technique (although showing black and white images and 
not colour). In particular, Figure 13a shows three 
frames of the source video sequence, Figure 13b shows the 
target image {which in this embodiment is computer 
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generated) and Figure 13c shows the corresponding three 
frames of the target video sequence obtained in the 
manner described above. As can be seen, an animated 
sequence of the computer generated character has been 
5 generated from a video clip of a real person and a single 

image of the computer generated character. 

HIERARCHICAL APPEARANCE MODEL 

In the systems described by Cootes et al and Blanz et al, 
10 the parametric model is created by placing a number of 

landmark points on a training image and then identifying 
the same landmark points on the other training images in 
order to identify how the location of and the pixel 
values around the landmark points vary within the 

15 training images. A principal component analysis is then 

performed on the matrix which consists of vectors of the 
landmark points. This PCA yields a set of Eigenvectors 
which describe the directions of greatest variation along 
which the landmark points change. Their appearance model 

20 includes the linear combination of the Eigenvectors plus 

parameters for translation, rotation and scaling. This 
single appearance model relates a compact set of 
appearance parameters to pixel values. 

25 In this embodiment, rather than having a single 

appearance model for the object, a hierarchical 
appearance model comprising several appearance models 
which model variations in components of the object is 
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used. For example, in the case of human faces, the 
hierarchical appearance model may include an appearance 
model for the mouth, one for the left eye, one for the 
right eye and one for the nose. Since it may be possible 
5 to model various components of the object, the particular 

hierarchical structure which will be used for a 
particular object and application must first of all be 
defined by the system designer. 



Figure 5 schematically illustrates the structure of the 
hierarchical appearance model used in this embodiment. 
As shown, at the top of the hierarchy there is a general 
face appearance model 61. Beneath the face appearance 
model there is a mouth appearance model 63, a left eye 
appearance model 65, a right eye appearance model 67, a 
left eyebrow appearance model 69, a rest of left eye 
appearance model 71, a right eyebrow appearance model 73, 
a rest of right eye appearance model 75 and, in this 
embodiment, a facet appearance model for each facet 
defined in the training images. Figure 6 shows the head 
of a training image in which the set of landmark points 
has been placed at the appropriate points on the head. 
As shown, in this embodiment, there are one hundred and 
forty-eight triangular areas or facets defined by the 
positions of the landmark points. Therefore, in this 
embodiment, there are one hundred and forty-eight facet 
appearance models 77. 
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The face appearance model 61 operates to relate a small 
number of "global" appearance parameters to a further set 
of appearance parameters , some of which are input to 
facet appearance models 77, some of which are input to 
the mouth appearance model 63, some of which are input to 
the left eye appearance model 65 and the rest of which 
are input to the right eye appearance model 67. The 
facet appearance models 77 operate to relate the input 
parameters received from the appearance model which is 
above it in the hierarchy into corresponding pixel values 
for that facet. The mouth appearance model 63 is 
operable to relate the parameters it receives from the 
face appearance model 61 into a further set of appearance 
parameters, respective ones of which are output to the 
respective facet appearance models 77 for the facets 
which are associated with the mouth. Similarly, the left 
and right eye appearance models 65 and 67 operate to 
relate the parameters it receives from the face 
appearance model 61 into a further set of appearance 
parameters, some of which are input to the appropriate 
eyebrow appearance model and the rest of which are input, 
to the appropriate rest of eye appearance model. These 
appearance models in turn convert these parameters into 
parameters for input to the facet appearance models 
associated with the facets which appear in the left and 
right eyes respectively. In this way, a small compact 
set of "global" appearance parameters input to the face 
appearance model 61 can filter through the hierarchical 
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structure illustrated in Figure 5 to generate a set of 
pixel values for all the facets in a head which can then 
be used to regenerate the image of the head. 

5 The way in which the individual appearance models of this 

hierarchical appearance model are generated in this 
embodiment will now be described with reference to 
Figures 6 to 10. 

10 In this embodiment, .each of the training images stored in 

the image database 32 is labelled with eighty six 
landmark points. In this embodiment, this is performed 
manually by the user via the user interface 33. In 
particular, each training image is displayed on the 

15 display 11 and the user places the landmark points over 

the head in the training image. These points delineate 
the main features in the head, such as the position of 
the hairline, neck, eyes, nose, ears and mouth. In order 
to compare training faces, each landmark point is 

20 associated with the same point on each face. In this 

embodiment, the following landmark points are used: 



Landmark Point 


Associated Position 


Landmark 
Point 


Associated Position 


LP, 


Left comer of left eye 


LP« 


Eye, bottom 


LP 2 


Right comer of right 
eye 


LR.5 


Eye, top 


LP 3 


Chin, bottom 


1-1*46 


Eye, bottom 


LP 4 


Right comer of left 
eye 


LP< 7 


Eyebrow, lower 
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Landmark Point 


Associated Position 


Landmark 
Point 


Associated Position 




eye 


LP48 


Kvcbrow, upper 


LPs 


Mouth, left 


LP 49 


Cheek, left 


LP 7 


Mouth, right 


LP 50 


Check, right 






LP S1 


Eyebrow, lower 




Nose between eyes 


LP 52 


Evebrow, upper 


LPio 


Upper lip, top 


LP 53 


Eyebrow, lower 


LPn 


Lower lip, bottom 


LP 5 4 


Eyebrow, upper 


LP- l2 


Neck, left, top 


LP55 


Eyebrow, lower 


LP 13 


Neck, right, top 


LP 5f 


Eyebrow, upper 


LP„ 


Face edge left, level 
with nose 


LP57 . 


Eyebrow, lower 


LP 15 


Face edge i 


LP 5a 


Evebrow, upper 


LP 16 


with nose 


LP 5S 


Eyebrow, lower 


LP 17 


Face edge 


LP 6 o 


Eyebrow, upper 


LP ia 


Top of head 


LP„ 


Eyebrow, lower 


LP 19 


Hair edge 


LP« 


Lower lip, top 


LP 20 


Hair edge 


LP 63 


Centre forehead 


LP 21 


Hair edge 


LP M 


Upper lip, top left 


LP 22 


Hair edge 


LP« 


Upper lip, lop right 


LP 23 


Hair edge 


LP 6 6 


Lower lip, bottom right 


LPs. 


Hair edge 


LP 67 


Lower lip, bottom left 


LP JS 


Hair edge 


LP 6S 


Eye, top left 


LP 2S 


Hair edge 


LP 69 


Eye, top right 


LP27 


Hair edge 


LP 70 


Eye, bottom right 


LP I8 


Hair edge 


LP 71 


Eye, bottom left 


L? 2 , 


Bottom, far left 


LP72 


Eye, top left 


LP 3 „ 


Bottom, far right 


LP 73 


Eye, top right 


LP,, 


Shoulder 


LP 7J 


Eye, bottom right 
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Landmark Point 




Landmark 
Point 


Associated Position j 


LP 32 


Shoulder 


LP75 


Eye, bottom left 


LP 3 3 


Bottom, left 


LP 76 


Lower lip, top left 


LP» 


Bottom, middle 


LP77 


Lower lip, top right 


LP* 


Bottom, right 


LP 78 


Chin, left 


LP 3j 


Left forehead 


LP79 


Chin, right 


LP 3 7 




LPso 


Neck, left 


LP 38 


Centre, between 
eyebrows 


LPsi 


Neckline, left 


LP 39 


Nose, left 


LP 32 


Neckline 




Nose, right 


LP S3 


Neckline, right 


LP,, 


Nose edge, left 


LP M . 


Neck, right 


lU^ 


Nose edge, right 


LP g . 


Hair edge 


|l LP. 


Eye, top 


LPg« 


Hair edge 



The result of the manual placement of the landmark points 
is a table of landmark points for each training image, 
which identifies the (x, y) coordinate of each landmark 
point within the image. As shown in Figure 6, these 
landmark points are also used to define the location of 
predetermined triangular facets or areas within the 
training image. 



FACET APPEARANCE MODEL 

Figure 7 shows a flow chart illustrating the main 
processing steps involved in this embodiment in 
determining a facet appearance model for facet (i). As 
shown, in step s61, the system determines, for each 
training image, the apex coordinates of facet (i) and 
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texture values from within facet (i) . In order to sample 
texture from within the facet at corresponding points 
within each training facet, a transformation which 
transforms the facet onto a reference facet is 
5 determined. Figure 8 illustrates this transformation. 

In particular, Figure 8 shows facet f L v taken from the 
V-th training image, which is defined by the landmark 
points (x^y^), (x 2 v ,y 2 v ) and (x 3 v ,y 3 v ) • The 
transformation (Ti v ) which transforms those coordinates 
10 onto coordinates (0,0), (1,0) and (0,1) is determined. 

In this embodiment, the texture information extracted 
from each training facet is defined by the regular array 
of pixels shown in the reference facet. In order to 
determine the corresponding red, green and blue pixel 
15 values in the training image, the inverse transformation 

{ [T± v ]- 1 ) is used to transform the pixel locations in the 
reference facet, into corresponding locations in the 
training facet, from which the RGB pixel values are 
determined. In this embodiment, this transformation may 
20 not result in an exact correspondence with a single image 

pixel location since the pixel resolution in the actual 
facet may be different to the resolution in the reference 
facet. In this embodiment, the texture information (RGB 
pixel values) which is determined is obtained by 
25 interpolating between the surrounding image RGB pixel 

values. In this embodiment, there are fifty pixels in 
the regular array of pixels in the reference facet. 
Therefore, fifty RGB pixel values are extracted for each 
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training facet. The texture information for facet (i) 
from the V-th training image can then be represented by 
a vector (t iv ) of the form: 

5 t iv = ct^t^tji* ... t,*r 

where t x iv is the RGB texture information for the first 
reference pixel extracted from facet (i) in the V-th 
training image etc. 

.0 

In this embodiment, the facet appearance models 77 treat 
shape and texture separately. Therefore, in step s63, 
the system performs a principal component analysis (PCA) 
on the set of texture training vectors generated in step 

.5 s61. For a more detailed discussion of principal 

component analysis, the reader is referred to the book by 
W. J. Krzanowski entitled "Principles of Multivariate 
Analysis - A User's Perspective" 1998, Oxford Statistical 
Science Series. As those skilled in the art will 

!G appreciate, this principal component analysis determines 

all possible modes of variation within the training 
texture vectors. However, since each of the facets is 
associated with a similar point on the face, most of the 
variation within the data can be explained by a few modes 

!5 of variation. The result of the principal component 

analysis is a facet texture appearance model (defined by 
matrix F L ) which relates a vector of facet texture 
parameters to a vector of texture pixel values, by: 
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E m = F.it^-P) . (3) 

where t iv is the RGB texture vector defined above, f 1 .is 
the mean RGB texture vector for facet (i), F L is a matrix 
5 which defines the facet texture appearance model for 

facet (i) and p Eit v is a vector of the facet texture 
parameters which describes the RGB texture vector t iV . 
The matrix F L describes the main modes of variation of 
the texture within the training facets; and the vector of 
10 facet texture parameters (g^M for a given input facet 

has a parameter associated with each mode of variation 
whose value relates the texture of the input facet to the 
corresponding mode of variation. 

15 As those skilled in the art will appreciate, for facets 

which describe fairly constant parts of the face, such as 
the chin or cheeks, very few parameters will be needed to 
model the variability within the training images. 
However, facets which are associated with areas of the 

20 face where there is a large amount of variability (such 

as facets which form part of the eye), will require a 
larger number of facet texture parameters to describe the 
variability within the training images. Therefore, in 
step s65, the system determines how many texture 

25 parameters are needed for the current facet and stores 

the appropriate facet appearance model matrix. 

In addition to being able to determine a set of texture 
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parameters for a given texture vector t iv , equation 

(3) can be solved with respect to the texture vector t lv 
to give : 

5- i^-F-J/fiJ" (4) 

since F L F^ equals the identity matrix. Therefore, by 
modifying the set of texture parameters (& Lt ) within 
suitable limits, new textures for facet (i) can be 
10 generated which are similar to those in the training set. 

Once the above procedure has been performed for each of 
the one hundred and forty-eight facets in the training 
images, a facet texture appearance model will have been 

15 generated for each of those facets. In this embodiment, 

the facet appearance model does not compress the 
parameters defining the shape of the facets, since only 
six parameters are needed to define the shape of each 
facet - two parameters for each (x,y) coordinate of the 

2 0 facet's apexes. 

MOUTH APPEARANCE MODEL 

Figure 9 shows a flow chart illustrating the main 
processing steps required in order to generate the mouth 
25 appearance model 63. As shown, in step s67, the system 

uses the facet appearance models for the facets which 
form part of the mouth to generate shape and texture 
parameters from those facets for each training image. 
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Therefore, referring to Figure 10, the mouth appearance 
model 63 will receive texture and shape parameters from 
the facet appearance model for facet (i), facet (j) and 
facet (n) for the corresponding facets in each of the 
training images 79. As illustrated in Figure 10, the 
appearance model for facet (i) is operable to generate, 
for each training image, six shape parameters 
(corresponding to the three (x,y) coordinates of the 
apexes of facet <i)) and six texture parameters. 
Similarly, the appearance model for facet (j) is operable 
to generate, for each training image, six shape 
parameters and four texture parameters and the appearance 
model for facet (n) is operable to generate, for each 
training image, six shape parameters and three texture 
parameters. 

The processing then proceeds to step s69 where the system 
performs a principal component analysis on the shape and 
texture parameters generated for the training images by 
the facet appearance models associated with the mouth. 
In this embodiment, the mouth appearance model 63 treats 
the shape and texture separately. In particular, for 
each training image, the system concatenates the six 
shape parameters for the facets associated with the mouth 
to form the following shape vector: 
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and concatenates the facet texture parameters output by 
the facet appearance models associated with the mouth to 
form the following texture vector: 

5 gFM* = [ Pl »It # p«t ... p s *i* : p/^^j'J* ... P/ jt * Pi*"' ■••J" 

The system then performs a principal component analysis 
on the shape vectors generated by all the training images 
to generate a shape appearance model for the mouth 
10 (defined by matrix M 9 ) which relates each mouth shape 

vector to a corresponding vector of shape mouth 
parameters by: 

^ = M s in™ s -u ms ) < 5 > 

15 

where is the mouth shape vector for the mouth in the 

V-th training image, JF Ka is the mean mouth shape vector 
from the training vectors and £» s v is a vector of mouth 
shape parameters for the mouth shape vector r/ M V The 

2 0 mouth shape model, defined by matrix M a , describes the 

main modes of variation of the shape of the mouths within 
the training images; and the vector of mouth shape 
parameters for the mouth in the V-th training image 

has a parameter associated with each mode of variation 

25 whose value relates the shape of the input mouth to the 

corresponding mode of variation. 

As with the facet appearance models, equation (5) above 
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can be rewritten with respect to the mouth shape vector 
£™ s v to gives 



pf = Z*»-M?Z% (6) 

5 

since M s M S T equals the identity matrix. Therefore, by 
modifying the mouth shape parameters, new mouth shapes 
can be generated which will be similar to those in the 
training set. 

10 

The system then performs a principal component analysis 
on the mouth texture parameter vectors (£ rMt ) which are 
generated for the training images. This principal 
component analysis generates a mouth texture model 
15 (defined by matrix Mt ) which relates each of the facet 

texture parameter vectors for the facets associated with 
the mouth, to a corresponding vector of mouth texture 
parameters, by: 



20 ^ = M t (Q^-u mt ) (7) 

where is a vector of mouth facet texture parameters 

generated by the facet appearance models associated with 
the mouth for the mouth in the V-th training image; i? Mt 
25 is the mean vector of mouth facet texture parameters from 

the training vectors and is a vector of mouth texture 
parameters for the facet texture parameters . The 

matrix M t describes the main modes of variation within 
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the training images of the facet texture parameters 
generated by the facet appearance models which are 
associated with the mouth; and the vector of mouth 
texture parameters (p Mt v ) has a parameter associated with 
5 each of those modes of variation whose value relates the 

texture of the input mouth to the corresponding mode of 
variation . 

The processing then proceeds to step s71 shown in Figure 
10 9 where the system determines the number of shape 

parameters and texture parameters needed to describe the 
training data received from the facet appearance models 
which are associated with the mouth. As shown in Figure 
10, in this embodiment, the mouth appearance model 63 
15 requires five shape parameters and four texture 

parameters to be able to model most of this variation. 
The system therefore stores the appropriate mouth shape 
and texture appearance model matrices for subsequent use. 

2 0 As those skilled in the art will appreciate, a similar 

procedure is performed to determine each of the 
appearance models shown in Figure 5, starting from the 
facet appearance models at the base of the hierarchy. A 
further description of how these remaining appearance 

25 models are determined will, therefore, not be given here. 

The resulting hierarchical appearance model allows a 
small number of global face appearance parameters to be 
input to the face appearance model 61, which generates 
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further parameters which propagate down through the 
hierarchical model structure until facet pixel values are 
generated, from which an image which corresponds to the 
global appearance parameters can be generated. 

5 

AUTOMATIC GENERATION O F APPEARANCE PARAMETERS 
In the description given above of the way in which the 
appearance models are generated, appearance parameters 
for an image were generated from a manual placement of a 

10 number of landmark points over the image. However, 

during use of the appearance model to track the first 
actor ' s head in the source video sequence and during the 
calculation of the difference parameters (£dif) i "the 
appearance parameters for the heads in the input images 

15 were automatically calculated. This task involves 

finding the set of global appearance parameters p. which 
best describe the pixels in view. This problem is 
complicated because the inverse of each of the appearance 
models in the hierarchical appearance model is not 

20 necessarily one-to-one. In this embodiment, the 

appearance parameters for the head in an input image are 
calculated in a two-step process. In the first step, an 
initial set of global appearance parameters for the head 
in the current frame (Is*) is found using a simple and 

25 rapid technique. For all but the first frame ' of the 

source video sequence, this is achieved by simply using 
the appearance parameters from the preceding video frame 
(Is 1-1 ) before modification in step s3 {i.e. parameters 
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D. E i_1 ). In this embodiment, the global appearance 
parameters (p.) effectively define the shape and colour 
texture of the head. For the first frame and for the 
target image the initial estimate of the appearance 
5 parameters is set to the mean set of appearance 

parameters and the scale, position and orientation is 
initially estimated by the user manually placing the mean 
head over the head in the image. 

In the second step, an iterative technique is used in 
order to make fine adjustments to the initial estimate of 
the appearance parameters . The adjustments are made in 
an attempt to minimise the difference between the head 
described by the global appearance parameters (the model 
head) and the head in the current video frame (the image 
head). With 50 appearance parameters, this represents a 
difficult optimisation problem. This can be performed by 
using a standard steepest descent optimisation technique 
to iteratively reduce the mean squared error between the 
given image pixels and those predicted by a particular 
set of appearance parameter values. In particular, 
minimising the following error function E(g) : 

E(JD) = [I a -F(£)] r [I a -F{ E ) ] < 8 > 
25 

where J a is a vector of actual image RGB pixel values at 
the locations where the appearance model predicts values 



15 
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{the appearance model does not predict all pixel values 
since it ignores background pixels and only predicts a 
subsample of 4 pixel values within the object being 
modelled) and F(p) is the vector of image RGB pixel 
values predicted by the hierarchical appearance model. 
As those skilled in the art will appreciate, E(£) will 
only be zero when the model head (i.e. F{£)) predicts the 
actual image head <2 a ) exactly. Standard steepest 
descent optimisation techniques stipulate that a step in 
the direction -VE(g) should result in a reduction in the 
error function E(u) , provided the error function is well 
behaved. Therefore, the change (Ap) in' the set of 
parameter values should be: 

Ap - 2[VF( E ) ] r [r a -F( E ) J ( 9 ) 

which requires the calculation of the differential of the 
appearance model, i.e. VFfgJ . 

The technique described by Edwards et al assumes that, on 
average over the whole parameter ; space, VF(p| is 
constant . The update equation then becomes : 



Ap. = A[I a -F{£)] 



(10) 



for some constant matrix A (referred to as the "Active 
. matrix") which is determined beforehand. during a training 
routine. In this embodiment, rather than using a single 



WO 01/37222 



PCT/GBOO/04411 



31 

constant matrix associated with the entire hierarchical 
appearance model, an Active matrix is determined and used 
for each of the individual appearance models which form 
part of the hierarchical appearance model. The way in 
which these Active matrices are determined in this 
embodiment will now be described with reference to 
Figures 11a and lib, which illustrate the processing 
steps performed to generate the Active matrix for each 
facet appearance model and the Active matrix for the 
mouth appearance model. 

As shown in Figure 11a, in step s73, the system chooses 
a random facet parameter vector (c^ 1 ) for the current 
facet (i) and then, in s75, perturbs this facet parameter 
vector by a small random amount to create + At/ 1 . In 
this embodiment, the facet parameter vectors include not 
only the texture parameters, but also the six shape 
parameters which define the (x,y) coordinates of the 
facet's location within the image. The processing then 
proceeds to step s77 where the system uses the parameter 
vector r/ 1 - 4 and the perturbed parameter vector r/ 1 + Ap/' i to 
create model images I/ 1 and I/ 1 respectively. The 
processing then proceeds to step s79 where the system 
records the parameter change Ac/ 1 and image difference I/ 1 
- I 0 Fl . Then in step s81, the system determines whether 
or not there is sufficient training data for the current 
facet. If there is not then the processing returns to 
step s21. Once sufficient training data has been 
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generated, the processing proceeds to step s83 where the 
system performs multiple multivariate linear regressions 
on the data for the current facet to identify an Active 
matrix (A Fi ) for the current facet. 

Figure lib shows the processing steps required to 
calculate the Active matrix for the mouth appearance 
model. As shown, in step s85, the system chooses a 
random mouth parameter vector tf. In this embodiment, 
this vector includes both the mouth shape parameters and 
the mouth texture parameters. Then, in step s87, the 
system perturbs this mouth parameter vector by a small 
random amount to create & + Ajf. The processing then 
proceeds to step s89 where the system uses the mouth 
parameter vectors & and the perturbed mouth parameter 
vector p_ w + Ap_ M to create model images I 0 m and If 
respectively, using the mouth appearance model and the 
facet appearance models associated with the mouth. The 
processing then proceeds to step s91 where the facet 
appearance models associated with the mouth are used 
again to transform the mouth model images I 0 m and If into 
corresponding facet appearance parameters p 0 FH and 
which are then subtracted to determine the corresponding 
change Ap_ FH . in the mouth facet parameters. The 
processing then proceeds to step s93 where the system 
records the mouth parameter change ApJ 4 and the mouth 
facet parameter change Ap™. ' The processing then 
proceeds to step s95 where the system determines whether 
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or not there is sufficient training data. If there is 
not, then the processing returns to step s85. Once 
sufficient training data has been generated, the 
processing proceeds to step s97, where the system 
performs multiple multivariate linear regressions on the 
training data for the mouth to identify the Active matrix 
(A„) for the mouth which relates changes in mouth 
parameters Ap_ M to changes in facet parameters &p FH for the 
facets associated with the mouth. 

As those skilled in the art will appreciate, a similar 
processing technique is used in order to identify the 
Active matrix for each of the appearance models shown in 
Figure 5 . 

Once the Active matrices have been determined for the 
hierarchical appearance model, they can then be used to 
iteratively update a current estimate of a set of 
appearance parameters for an input image. Figure 12 
illustrates the processing steps performed in this 
iterative routine for the current source video frame. As 
shown, in step slOl, the system initially estimates a set 
of global parameters for the head in the current source 
video frame. The processing then proceeds to step sl03 
where the system generates a model image from the 
estimated global parameters and the hierarchical 
appearance model. The system then proceeds to step sl05 
where it determines the image error between the model 
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image and the current source video frame . Then , in step 
sl07, the system uses this image error to propagate 
parameter changes up the hierarchy of the hierarchical 
appearance model using the stored Active matrices to 
determine a change in the global parameters. This change 
in global parameters is then used, in step sl09, to 
update the current global parameters for the current 
source video frame. The system then determines, in step 
sill, whether or not convergence has been reached by 
comparing the error obtained from equation ( 8 ) using the 
updated global parameters with a predetermined threshold 
(Th). If convergence has not been reached, then the 
processing returns to step sl03. Once convergence is 
reached, the processing proceeds to step sll3, where the 
current global appearance parameters are output as the 
global appearance parameters for the current source video 
frame and then the processing ends. 



ALTERNATI VE EMBODIMENTS 

In the above embodiment, the same hierarchical model 
structure was used to model the variation in the shape 
and texture within the training images. As those skilled 
in the art will appreciate, one model hierarchy can.be 
used to model the shape variation and a different model 
hierarchy can be used to model the texture variation. 
Alternatively still, rather than separating the shape and 
texture parameters, each of the appearance models within 
• the hierarchical model may model the combined variation 
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of the shape and texture within the training images. 

In the above embodiments, a facet appearance model was 
generated for each facet defined within the training 
5 images. As those skilled in the art will appreciate, 

many of the facets may be grouped together such that a 
single facet appearance model is generated for those 
facets. In one form of such an embodiment, a single 
facet appearance model may be determined which models the 
10 variability of texture within each facet of the training 

images . 

In the above embodiments, the same amount of texture 
information was extracted from each facet within the 

15 training images . In particular, fifty RGB texture values 

were extracted from each training facet. In an 
alternative embodiment, the amount of texture information 
extracted from each facet may vary in dependence upon the 
size of the facet. For example, more texture information 

20 may be extracted from larger facets or more texture 

information may be extracted from facets associated with 
important features of the face, such as the mouth, eyes 



25 In the above embodiments, each appearance model was 

determined from a principal component analysis of a set 
of training data. This principal component analysis 
determines a linear relationship between the training 
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data and a set of model parameters. As those skilled in 
the art will appreciate, techniques other than principal 
component analysis can be used to determine a parametric 
model which relates a set of parameters to the training 
data. This model may define a non-linear relationship 
between the training data and the model parameters. For 
example, one or more of the models within the hierarchy 
may comprise a neural network which relates the set of 
input parameters to the training data. 

In the above embodiments, a principal component analysis 
was performed on a set of training data in order to 
identify a relatively small number of parameters: which 
describe the main modes of variation within the training 
data. This allows a relatively small number of input 
parameters to be able to generate a larger set of output 
parameters from the model. However, as those skilled in 
the art will appreciate, this is not essential. One or 
more of the appearance models may act as transformation 
models in which the number input parameters is the same 
as or greater than the number of output parameters . This 
can be used to generate a set of input parameters which 
can be changed by the user in some intuitive way. For 
example, in order to .identify parameters which have a 
linear relationship with features in the object, such as 
a parameter that linearly changes the amount of smile 
1 within a face image. 
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In the above embodiments, a set of Active matrices were 
used in order to identify automatically a set of 
appearance parameters for an input image. As those 
skilled in the art will appreciate, rather than having 
5 separate Active matrices for each of the components in 

the hierarchical appearance model, a global Active matrix 
may be used instead. Further, although both the shape 
and grey level parameters were used in order to derive 
the Active matrices, suitable Active matrices can be 
10 determined using just the shape information. 

In the above embodiments, the variation in both the shape 
and texture within the training images were modelled. As 
those skilled in the art will appreciate, this 
15 hierarchical modelling technique can be used to model 

only the shape of the objects within the training images. 
Such a shape model could be then used to track objects 
within a video sequence. 



20 In the first embodiment, the target image illustrated a 

computer generated head. This is not essential. For 
example, the target image might be a hand-drawn head or 
an image of a real person. Figures 13d and 13e 
illustrate how an embodiment with a hand-drawn character 

25 might be used in character animation. In particular, 

Figure 13d shows a hand-drawn sketch of a character 
which, when combined with the images from the source 
video sequence (some of which are shown in Figure 13a) 
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generate a target video sequence, some frames of which 
are shown in Figure 13e. As can be seen from a 
comparison of the corresponding frames in the source and 
target video frames, the hand-drawn sketch has been 
5 animated automatically using this technique. As those 

skilled in the art will appreciate, this is a much 
quicker and simpler technique for achieving computer 
animation as compared with existing systems which require 
the animator to manually create each frame of the 
10 animation. In particular, in this embodiment, all that 

is required is a video sequence of a real life actor 
acting out the scene to be animated, together with a 
single sketch of the character to be animated. 

15 The above embodiment has described the way in which a 

target image can be used to modify a source video 
sequence. In order to do this, a set of appearance 
parameters has to be automatically calculated for each 
frame in the video sequence. This involved the use of a 

20 number of Active matrices which relate image errors to 

appearance parameter changes. As those skilled in the 
art will appreciate, similar processing is required in 
other applications, such as the tracking of an object 
within a video sequence, the tracking of a human face 

25 within a video sequence or the tracking of. a knee joint 

in an MRI scan. 

In the above embodiment, the appearance model was used to 
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model the variations in facial expressions and 3D pose of 
human heads. As those skilled in the art will 
appreciate, the appearance model can be used to model the 
appearance of any deformable object such as parts of the 
body and other animals and objects. For example, the 
above techniques can be used to track the movement of 
lips in a video sequence. Such an embodiment could be 
used in film dubbing applications in order to synchronise 
the lip movements with the dubbed sound. This animation 
technique might also be used to give animals and other 
objects human-like characteristics by combining images of 
them with a video sequence of an' actor. This technique 
can also be used for monitoring the shape and appearance 
of objects passing along a production line for quality 
control purposes. 

In the above embodiment, the appearance model was 
generated by using a principal component analysis of 
shape and texture data which is extracted from the 
training images. As those skilled in the art will 
appreciate, by modelling the features of the training 
heads in this way, it is possible to accurately model 
each head by just a small number of parameters. However, 
other modelling techniques, such as vector quantisation 
and wavelet techniques can be used. 



In the above embodiments, the training images used to 
generate the appearance model were all colour images in 
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which each pixel had an RGB value . As those skilled in 
the art will appreciate, the way in which the colour is 
represented in this embodiment is not important. In 
particular, rather than each pixel having a red, green 
and blue value, they might be represented by a 
chrominance and a luminance component or by hue, 
saturation and value components. Alternatively still, 
the training images may be black and white images, in 
which case only grey level data would be extracted from 
the facets in the training images. Additionally, the 
resolution of each training image may be different. 

In the above embodiment, during the automatic generation 
of the appearance parameters, and in particular during 
the iterative updating of these appearance parameters the 
error between the input image and the model image was 
generated using the appearance model. Since this 
iterative technique still requires a relatively accurate 
initial estimate for the appearance parameters, it is 
possible initially to perform the iterations using lower 
resolution images and once convergence has been reached 
for the lower resolutions to then increase the resolution 
of the images and to repeat the iterations for the higher 
resolutions. In such an embodiment, separate Active 
matrices would be required for each of the resolutions. 

In the above embodiment, the difference parameters were 
determined by comparing the image of the first actor from 
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one of the frames of the source video sequence with the 
image of the second actor in the target image. In an 
alternative embodiment, a separate image of the first 
actor may be provided which does not form part of the 
5 source video sequence. 

In the above embodiments, each of the appearance models 
modelled variations in two-dimensional Images'. The above 
modelling technique could be adapted to work with 3D 

.0 images and animations. In such an embodiment, the 

training images used to generate the appearance model 
would normally include 3D images' instead of 2D images. 
The three-dimensional models may be obtained using a 
three dimensional scanner which typically work either by 

.5 using laser range-finding over the object or by using one 

or more stereo pairs of cameras. Once a 3D hierarchical 
appearance model has been created from the training 
models, new 3D models can be generated by adjusting the 
appearance parameters and existing 3D models can be 

:0 animated using the same differencing technique that was 

used in the two-dimensional embodiment described above. 
This 3D model can then be used to track 3D objects 
directly within a 3D animation. Alternatively, a 2D 
model may be used to track the 3D object within a video 

:5 sequence and then use the result to generate 3D data for 

the tracked object. 

In the above embodiment, a set of difference parameters 
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were identified which describe the main differences 
between the head in the video sequence and the head in 
the target image, which difference parameters were used 
to modify the video sequence so as to generate a target 
video sequence showing the second head. In the 
embodiment, the set of difference parameters were added 
to a set of appearance parameters for the current frame 
being processed. In an alternative embodiment, the 
difference parameters may be weighted so that, for 
example, the target video sequence shows a head having 
characteristics from both the first and second actors. 

in the above embodiment, a hierarchical appearance model 
is used to model the appearance of human faces. The 
model is then used to modify a source video sequence 
showing a first actor performing a scene to generate a 
target video sequence showing a second actor performing 
the same scene. As those skilled in the art will 
appreciate, the hierarchical model presented above can be 
used in various other applications. For example, the 
hierarchical appearance model can be used for synthetic 
two-dimensional or three-dimensional character 
generation; video compression when the video is 
' substantially that of an object which is modelled by the 
appearance model; object recognition for security 
purposes; face tracking for human performance analysis or 
human computer interaction and the like; 3D- " model 
generation from two-dimensional images; and image, editing 
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(for example making people look older or younger, fatter 
or thinner etc ) . 

In the above embodiment, an iterative process was used to 
5 update an estimated set of appearance parameters for an 

input image. This iterative process continued until an 
error between the actual image and the image predicted by 
the model was below a predetermined threshold. In an 
alternative embodiment, where there is only a 
10 predetermined amount of time available for determining a 

set of appearance parameters for an input image, this 
iterative routine may be performed for a predetermined 
period of time or for a predetermined number of 
iterations . 

15 
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1. A parametric model for modelling the shape of an 
object, the model comprising: 

data defining a function which relates a set. of 
input parameters to a set of locations which identify the 
relative positions of a plurality of predetermined points 
on the object; 

characterised in that said data defines a 
hierarchical set of functions in which a function in a 
top layer of the hierarchy is operable to generate a set 
of output parameters from a set of input parameters and 
in which one or more functions in a bottom layer of the 
hierarchy are operable to receive parameters output from 
one or more functions from a higher layer of the 
hierarchy and to generate therefrom at least some of said 
locations which identify the relative positions of said 
predetermined points. 

2. A model according to claim 1, wherein said hierarchy 
comprises one or more intermediate layers of functions 
which are operable to receive parameters output from one 
or more functions from a higher layer of the hierarchy 
and to generate therefrom a set of output parameters for 
input to functions in a lower layer of the hierarchy. 

3. A model according to claim 1 or 2, for modelling the 
two-dimensional shape of the object by identifying the 
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relative positions of said predetermined points in a 
predetermined plane. 

4 . A model according to claim 1 or 2, for modelling the 
three-dimensional shape of the object by identifying the 
relative positions of the predetermined points in a 
three-dimensional space. 

5. A model according to any preceding claim, wherein 
one or more of said functions comprises a linear function 
which linearly relates the input parameters to the 
function to the output parameters of the function. 

6. A model according to claim 5, wherein said one or 
more linear functions are identified from a principal 
component analysis of training data derived from a set of 
training objects. 

7. A model according to any preceding claim, wherein 
0 one or more of said functions are non-linear. 

8. A model according to claim 7, wherein at least one 
of said non-linear functions comprises a neural network. 

15 9. A model according to any preceding claim, wherein 

the number of parameters input to at least one of said 
functions is smaller than the number of parameters output 
from the function. 
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10. A model according to any preceding claim, wherein 
the number of input parameters to at least one of said 
functions is greater than or equal to the number of 
parameters output by the function. 

11. A model according to any preceding claim for 
modelling the shape and texture of the object, the model 
further comprising data defining a hierarchical set of 
functions in which a function in a top layer of the 
hierarchy is operable to generate a set of output 
parameters from a set of input parameters and in which 
one or more functions in a bottom' layer of the hierarchy 
are operable to receive parameters output from one or 
more functions from a higher layer of the hierarchy and 
to generate therefrom texture information for the object. 

12. A model according to claim 11, wherein the texture 
hierarchy has the same structure as the shape hierarchy. 

13. A model according to claim 11 or 12, wherein one or 
more of said functions are operable to relate an input 
set of shape and texture parameters to an output set of 
appearance parameters defining both shape and texture. 

14. A model according to any preceding claim, wherein 
said object is a deformable object. 



15. 



A model according to claim 14, wherein said 
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deformable object includes a human face. 

16. A model according to claim 15, wherein said function 
in said top layer of the hierarchy models the shape of 

5 the entire face and wherein said hierarchy includes a 

function which models the shape of the mouth. 

17. A model according to claim 16, wherein said 
hierarchy further comprises a function for modelling the 

10 shape of the eyes. 

18. A model according to any preceding claim, wherein 
the or each function in the bottom layer of the hierarchy 
identifies the positions of a plurality of predetermined 

15 points according to a predefined function of smaller 

number of control point positions. 

19. A model according to claim 18, wherein the 
predefined function for each of the plurality of points 

20 is a linear mapping of the control point positions and 

the control points are the three corners of a triangular 
facet. 

20. A model according to claim 18, wherein the 
25 predefined function for each of the plurality of points 

is a predefined non-linear mapping of a fixed number of 
control point positions. 



WO 01/37222 



PCT/GBOO/0441 1 



48 



21. A model according to claim 18, wherein the 
predefined function for each of the plurality of points 
is a predefined displacement from a single control point. 

22. A method of determining a set of appearance 
parameters representative of the appearance of an object, 
the method comprising the steps of: 

(i) storing a parametric model according to any of 
claims 1 to 21 which relates a set of input parameters to 
appearance data representative of the appearance of the 
object; 

(ii) storing at least one function which relates a 
change in the input parameters to an error between actual 
appearance data for the object and appearance data 
determined from the set of input parameters and said 
parametric model; 

(iii) initially estimating a current set of input 
parameters for the object; 

(iv) determining appearance data for the object from 
the current set of input parameters and the stored 
parametric model; 

(v) determining the error between actual appearance 
data of the object and the appearance data determined 
from the current set of input parameters; 

(vi) determining a change in the input parameters 
using said at least one stored function and said 
determined error; and 

(vii) updating the current set of input parameters 
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with the determined change in the input parameters . 

23. A method according to claim 22, further comprising 
the step of repeating steps (iv) to (vii) until the error 

5 determined in step {v) is less than a predetermined 

threshold. 

24. A method according to claim 22, further comprising 
the step of repeating steps (iv) to (vii) for a 

10 predetermined amount of time or for a predetermined 

number of repetitions. 

25. A method according to claim 22, 23 or 24, wherein 
said second storing step stores a plurality of functions, 

15 one associated with each function within the hierarchical 

model. 

26. A method of tracking an object comprising the steps 
of: 

20 (i) storing a parametric model according to any of 

claims 1 to 21 which relates a set of input parameters to 
appearance data representative of the appearance of the 
object; 

(ii) storing at least one function which relates a 
25 change in the input parameters to an error between the 

actual appearance data for the object and the appearance 
data determined from the set of input parameters and said 
parametric model; 
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(iii) initially estimating a current set of input 
parameters for the object; 

(iv) determining the appearance data for the object 
from the current set of input parameters and the stored 

5 parametric model; 

(v) determining an error between the actual 
appearance data for the object and the appearance data 
for the object determined from the current set of input 
parameters; 

10 (vi) determining a change in the input parameters 

using the at least one stored function and the determined 
error; 

(vii) updating the current set of input parameters 
with said change in the input parameters; 
15 (viii) repeating steps (iv) to (vii) in order to 

reduce the error determined in step (v) ; and 

(ix) repeating steps (iii) to (viii) to track the 
object. 

20 27. An apparatus for determining a set of appearance 

parameters representative of the appearance of an object, 
the apparatus comprising: 

means for storing (i) a parametric model according 
to any of claims X to 21 which relates a set of input 
25 parameters to appearance data representative of the 

appearance of the object; and (ii) at least one function 
which relates a change in the input, parameters to an 
error between actual appearance data for the. object and 
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-the appearance data for the object determined from the 
set of input parameters and said parametric model; 

means for receiving an initial estimate of a current 
set of input parameters for the object; 
5 means for updating the current set of input 

parameters comprising: 

(i) means for determining appearance data for the 
object from the current set of input parameters and the 
stored parametric model; 
10 <ii) means for determining the error between the 

actual appearance data for the object and the appearance 
data for the object determined from the current set of 
input parameters; 

(iii) means for determining a change in the input 
15 parameters using said at least one stored function and 

said determined error; and 

(iv) means for updating the current set of input 
parameters with the determined change in the input 
parameters . 

20 

28. An apparatus according to claim 27, wherein said 
updating means is operable to update iteratively the 
current set of input parameters until the error 
determining means determines an error which is less than 

25 a predetermined threshold. 

29. An apparatus according to claim 27 or 28, wherein 
said storing means stores a plurality of functions, one 
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associated with each function within the hierarchical 
model . 

30. An apparatus for tracking an object comprising: 

means for storing (i) a parametric model according 
to any of claims 1 to 21 which relates a set of input 
parameters to appearance data representative of the 
appearance of the object; and (ii) at least one function 
Which relates a change in the input parameters to an 
error between actual appearance data for the object and 
the appearance data for the object determined from the 
set of input parameters and said parametric model; 

means for receiving an initial estimate of a. current 
set of input parameters for the object; 

means for updating the current set of input 
parameters comprising: 

(i) means for determining appearance data for the 
object from the current set of input parameters and the 
stored parametric model; 

(ii) means for determining an error between actual 
appearance data for the object and the appearance data 
for the object determined from the current set of input 
parameters; 

(iii) means for determining a change in the input 
parameters using the at least one stored function and the 
determined error; and 

(iv) means for updating the current set of input 
parameters with said change in the input parameters; 
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wherein said updating means is operable to update 
iteratively the current set of input parameters in order 
to reduce the determined error, wherein said receiving 
means is operable to receive further estimates of the 
current input parameters and wherein said update means is 
operable to update the received estimates of the current 
input parameters in order to track said object. 

31. A storage medium storing the parametric model 
according to any of claims 1 to 21 or storing processor 
implementable instructions for controlling a processor to 
implement the method of any one of claims 22 to 26. 
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32. Processor implementable instructions for controlling 
a processor to implement the method of any one of claims 
22 to 26. 
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